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Presentation 


A series of important applications of combinatorics on words has emerged with 
the development of computerized text and string processing, especially in bi- 
ology and in linguistics. The aim of this volume is to present, in a unified 
treatment, some of the major fields of applications. The main topics that are 
covered in this book are 

1. Algorithms for manipulating text, such as string searching, pattern match- 
ing, and testing a word for special properties. 

2. Efficient data structures for retrieving information on large indexes, in- 
cluding suffix trees and suffix automata 

3. Combinatorial, probabilistic and statistical properties of patterns in finite 
words, and more general pattern, under various assumptions on the sources 
of the text. 

4. Inference of regular expressions. 

5. Algorithms for repetitions in strings, such as maximal run or tandem 
repeats. 

6. Linguistic text processing, especially analysis of the syntactic and semantic 
structure of natural language. Applications to language processing with 
large dictionaries. 

7. Enumeration, generation and sampling of complex combinatorial struc- 
tures by their encodings in words. 

This book is actually the third of a series of books on combinatorics on 
words. Lothaire’s “Combinatorics on Words” appeared in its first printing in 
1984 as Volume 17 of the Encyclopedia of Mathematics. It was based on the 
impulse of M. P. Schtitzenberger’s scientific work. Since then, the theory devel- 
oped to a large scientific domain. It was reprinted in 1997 in the Cambridge 
Mathematical Library. Lothaire is a nom de plume for a group of authors ini- 
tially constituted of former students of Schiitzenberger. Along the years, it has 
enlarged to a broader community coordinated by the editors. A second volume 
of Lothaire’s series, entitled “Algebraic Combinatorics on Words” appeared in 
2002. It contains both complements and newx developments that emerged since 
the publication of the first volume. 

The content of this volume is quite applied, in comparison with the two 
previous ones. However, we have tried to follow the same spirit, namely to 
present introductory expositions, with full descriptions and numerous examples. 
Refinements are frequently deferred to problems, or mentioned in Notes. There 
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is presently no similar book that covers these topics in this way. 

Although each chapter has a different author, the book is really a cooper- 
ative work. A set of common notation has been agreed upon. Algorithms are 
presented in a consistent way using transparent conventions. There is also a 
common general index, and a common list of bibliographic references. 

This book is independent of Lothaire’s other books, in the sense that no 
knowledge of the other volumes is assumed. 

The book has been written with the objective of being readable by a large 
audience. The prerequisits are those of a general scientific education. Some 
chapters may require a more advanced preparation. A graduate student in 
science or engineering should have no difficulty in reading all the chapters. A 
student in linguistics should be able to read part of it with profit and interest. 


Outline of contents. 
The general organisations is described below. 


Natural languages 


Symbolic language processing 


Statistical language processing 


Bioinformatics 


Inference of network expressions 


Core algorithms Statistics on words with applications 


Algorithms on words 


Algorithmics 


Structures for indexes 


Analytic approach to pattern matching 


Periodic structures in words 


Mathematics 


Counting, coding and sampling 


Words in number theory 


Figure 0.1. Overall structure of “Applied Combinatorics on Words”. 


The two first chapters are devoted to core algorithms. The first, “Algorithms 
on words”, is is quite general, and is used in all other chapters. The second 
chapter, “Structures for indexes”, is fundamental for all advanced algorithmic 
treatment, and more technical. 

Among the applications, a first domain is linguistics, represented by two 
chapters entitled “Symbolic language processing” and “Statictical language pro- 
cessing”. 
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A second application is biology. This is covered by two chapters, entitled 
“Inference of network expressions”, and “Statistics on Words with Applications 
to Biological Sequences” . 

The next block is composed of two chapters dealing with algorithmics, a 
subject which is of interest for its own in theoretical computer science, but also 
related to biology and linguistics One chapter is entitled “Analytic approach 
to pattern matching” and deals with generalized pattern matching algorithms. 
A chapter entitled “Periodic structures in words” describes algorithms used for 
discovering and enumerating repetitions in words. 

A final block is devoted to applications to mathematics (and theoretical 
physics). It is represented by two chapters. The first chapter, entitled “Count- 
ing, coding and sampling with words” deals with the use of words for coding 
combinatorial structures. Another chapter, entitled “Words in number theory” 
deals with transcendence, fractals and dynamical systems. 


Description of contents. 

Basic algorithms, as needed later, and notation are given in Chapter 1 “Algo- 
rithms on words”, written by Jean Berstel and Dominique Perrin. This chapter 
also contains basic concepts on automata, grammars, and parsing. It ends with 
an exposition of probability distribution on words. The concepts and methods 
introduced are used in all the other chapters. 

Chapter 2, entitled “Structures for indexes” and written by Maxime Croche- 
more, presents data structures for the compact representation of the suffixes of 
a text. These are used in several subsequent chapters. Compact suffix trees are 
presented, and construction of these trees in linear time is carefully described. 
The theory and algorithmics for suffix automata are presented next. The main 
application, namely the construction of indexes, is described next. Many other 
applications are given, such as detection of repetitions or forbidden words in a 
text, use as a pattern matching machine, and search for conjugates. 


The first domain of applications, linguistics, is represented by Chapter 3 and 
Chapter 4. Chapter 3, entitled “Symbolic language processing” is written by 
Eric Laporte. In language processing, a text or a discourse is a sequence of 
sentences; a sentence is a sequence of words; a word is a sequence of letters. 
The most universal levels are those of sentence, word and letter (or phoneme), 
but intermediate levels exist, and can be crucial in some languages, between 
word and letter: a level of morphological elements (e.g. suffixes), and the level 
of syllables. The discovery of this piling up of levels, and in particular of word 
level and phoneme level, delighted structuralist linguists in the 20th century. 
They termed this inherent, universal feature of human language as “double 
articulation”. 

This chapter is organized around the main levels of any language modelling: 
first, how words are made from letters; second, how sentences are made from 
words. It surveys the basic operations of interest for language processing, and 
for each type of operation, it examines the formal notions and tools involved. 
The main originality of this presentation is the systematic and consistent use 
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of finite state automata at every level of the description. This point of view 
is reflected in some practical implementations of natural language processing 
systems. 

Chapter 4, entitled “Statistical language processing” is written by Mehryar 
Mohri. It presents the use of statistical methods to natural language processing. 
The main tool developed is the notion of weighted transducers. The weights are 
numbers in some semiring that can represent probabilities. Applications to 
speech processing are discussed. 


The block of applications to biology is concerned with analysis of word oc- 
curences, pattern matching, and connections with genome analysis. It is covered 
by the next two chapters, and to some extent also by the alogrithmics bloc. 

Chapter 5, “Inference of network expressions”, is written by Nadia Pisanti 
and Marie-France Sagot. This chapter introduces various mathematical models 
and algorithms for inferring regular expression without Kleene star that appear 
repeated in a word or are common to a set of words. Inferring a network 
expression means to discover such expressions which are initially unknown, from 
the word(s) where the repeated (or common) expressions will be sought. This 
is in contrast with the string searching problem considered in other chapters. 
This has many applications, notably in molecular biology, system security, text 
mining etc. Because of the richness of the mathematical and algorithmical 
problems posed by molecular biology, we concentrate on applications in this 
area. Applications to biology motivate us also to consider network expressions 
that appear repeated not exactly but approximately. 

Chapter 6 is written by Gesine Reinert, Sophie Schbath and Michael Wa- 
terman, and entitled “Statistics on Words with Applications to Biological Se- 
quences”. Properties of words in sequences have been of considerable interest 
in many fields, such as coding theory and reliability theory, and most recently 
in the analysis of biological sequences. The latter will serve as the key example 
in this chapter. 

Two main aspects of word occurrences in biological sequences are: where 
do they occur and how many times do they occur? An important problem, for 
instance, was to determine the statistical significance of a word frequency in a 
DNA sequence. The naive idea is the following: a word may be significantly rare 
in a DNA sequence because it disrupts replication or gene expression, (perhaps 
a negative selection factor), whereas a significantly frequent word may have a 
fundamental activity with regard to genome stability. Well-known examples 
of words with exceptional frequencies in DNA sequences are certain biological 
palindromes corresponding to restriction sites avoided for instance in EF. coli, 
and the Cross-over Hotspot Instigator sites in several bacteria. 

Statistical methods to study the distribution of the word locations along a 
sequence and word frequencies have also been an active field of research; the 
goal of this chapter is to provide an overview of the state of this research. 

Because DNA sequences are long, asymptotic distributions were proposed 
first. Exact distributions exist now, motivated by the analysis of genes and 
protein sequences. Unfortunately, exact results are not adapted in practice for 
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long sequences because of heavy numerical calculation, but they allow the user to 
assess the quality of the stochastic approximations when no approximation error 
can be provided. For example, BLAST is probably the best-known algorithm 
for DNA matching, and it relies on a Poisson approximation. This is another 
motivation for the statistical analysis given in this chapter. 


The algorithmics block is composed of two chapters. In Chapter 7, entitled 
“Analytic approach to pattern matching”, and written by Philippe Jacquet 
and Wojciech Szpankowski, pattern matching is considered for various types 
of patterns, and for various types of sources. Single patterns, sequences of 
patterns, and sequences of patterns with sepatation conditions are considered. 
The sources are Bernoulli and Markov, and also more general sources arising 
from dynamical systems. The derivation of the equations is heavily based on 
combinatorics on words and formal languages. 

Chapter 9, written by Roman Kolpakov and Gregory Koucherov and entitled 
“Periodic structures in words”, deals with the algorithmic problem of detect- 
ing, counting and enumeration repetitions in a word. The interest for this is in 
text processing, compression and genome analysis, where tandem repeats may 
have a particular signification. Linear time algorithm exist for detecting tan- 
dem repeats, but since there may be quadratically many repetitions, maximal 
repetitions or “runs” are of importance, and are considered in this chapter. 


A final block is concerned with applications to mathematics. Chapter 8, 
written by Dominique Poulalhon and Gilles Schaeffer, is entitled “Counting, 
coding and sampling with words”. Its aim is to give typical descriptions of 
the interaction of combinatorics on words with the treatments of combinatorial 
structures. The chapter is focused on three aspects of enumeration: counting el- 
ements of a family according to their size, generating them uniformly at random, 
and coding them as compactly as possible by binary words. These aspects are 
respectively illustrated on examples taken from classical combinatorics (walks 
on lattices), from statistical physics (convex polyominoes and directed animals), 
and from graph algorithmics (planar maps). The rationale of the chapter is that 
nice enumerative properties are the visible traces of structural properties, and 
that making the latters explicit in terms of words of simple languages is a way 
to solve simultaneously and simply the three problems above. 

Chapter 10 is written by Jean-Paul Allouche and Valérie Berthé. It is enti- 
tled “Words in number theory”. This chapter is concerned with the intercon- 
nection between combinatorial properties of infinite words, such as repetitions, 
and transcendental numbers. A second part considers a famous infinite word, 
called the Tribonacci word, to investigate and illustrate connections between 
combinatorics on words and dynamical systems, quasicrystals, the Rauzy frac- 
tal, rotation on the torus, etc. Relations to the cut and project method are 
described, and an application to simultaneous approximation is given. 
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1.0. Introduction 


This chapter is an introductory chapter to the book. It gives general notions, 
notation and technical background. It covers, in a tutorial style, the main 
notions in use in algorithms on words. In this sense, it is a comprehensive 
exposition of basic elements concerning algorithms on words, automata and 
transducers, and probability on words. 

The general goal of “stringology” we pursue here is to manipulate strings of 
symbols, to compare them, to count them, to check some properties and perform 
simple transformations in an effective and efficient way. 

A typical illustrative example of our approach is the action of circular per- 
mutations on words, because several of the aspects we mentioned above are 
present in this example. First, the operation of circular shift is a transduction 
which can be realized by a transducer. We include in this chapter a section (Sec- 
tion 1.5) on transducers. Transducers will be used in Chapter 3. The orbits of 
the transformation induced by the circular permutation are the so-called con- 
jugacy classes. Conjugacy classes are a basic notion in combinatorics on words. 
The minimal element in a conjugacy class is a good representative of a class. It 
can be computed by an efficient algorithm (actually in linear time). This is one 
of the algorithms which appear in Section 1.2. Algorithms for conjugacy are 
again considered in Chapter 2. These words give rise to Lyndon words which 
have remarkable combinatorial properties already emphasized in Lothaire 1997. 
We describe in Section 1.2.5 the Lyndon factorization algorithm. 

The family of algorithms on words has features which make it a specific field 
within algorithmics. Indeed, algorithms on words are often of low complexity 
but intricate and difficult to prove. Many algorithms have even a linear time 
complexity corresponding to a single pass scanning of the input word. This con- 
trasts with the fact that correctness proofs of these algorithms are frequently 
complex. A well-known example of this situation is the Knuth—Morris—Pratt 
string searching algorithm (see Section 1.2.3). This algorithm is compact, and 
apparently simple but the correctness proof requires an sophisticated loop in- 
variant. 

The field of algorithms on words still has challenging open problems. One 
of them is the minimal complexity of the computation of a longest common 
subword of two words which is still unknown. We present in Section 1.2.4 the 
classic quadratic dynamic programming algorithm. A more efficient algorithm 
is mentioned in the Notes. 

The field of algorithms on words is intimately related to formal models of 
computation. Among those models, finite automata and context-free grammars 
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are the most used in practice. This is why we devote a section ((Section 1.3) to 
finite automata and another one to grammars and syntax analysis (Section 1.6). 
These models, and especially finite automata, regular expressions and transduc- 
ers, are ubiquous in the applications. They appear in almost all chapters. 

The relationship between words and probability theory is an old one. Indeed, 
one of the basic aspects of probability and statistics is the study of sequences 
of events. In the elementary case of a finite sample space, like in coin tossing, 
the sequence of outcomes is a word. More generally, a partition of an arbitrary 
probability space into a finite number of classes produces sequences over a fi- 
nite set. Section 1.8 is devoted to an introduction to these aspects. They are 
developed later in Chapters 6 and 7. 


We have chosen to present the algorithms and the related properties in a 
direct style. This means that there are no formal statements of theorems, and 
consequently no formal proofs. nevertheless, we give precise assertions and 
enough arguments to show the correctness of algorithms and to evaluate their 
complexity. In some cases, we use results without proof and we give biblio- 
graphic indications in the Notes. 

For the description of algorithms, we use a kind of programming language 
close to usual programming languages. It gives more flexibility and improves 
the readability to do so instead of relying on a precise programming language. 

The syntactic features of our programs make it similar to a language like 
Pascal, concerning the control structure and the elementary instructions. We 
take some liberty with real programs. In particular, we often omit declarations 
and initializations of variables. The parameter handling is C-like (no call by 
reference). Besides arrays, we use implicitly data structures like sets and stacks 
and pairs or triples of variables to simplify notation. All functions are global, and 
there is nothing like classes or other features of object-oriented programming. 
However, we use overloading for parsimony. The functions are referenced in 
the text and in the index by their name, like LONGESTCOMMONPREFIX() for 
example. 


1.1. Words 


We briefly introduce the basic terminology on words. Let A be a finite set 
usually called the alphabet. In practice, the elements of the alphabet may be 
characters from some concrete alphabet, but also more complex objects. They 
may be themselves words on another alphabet, as in the case of syllables in nat- 
ural language processing, as presented in Chapter 3. In information processing, 
any kind of record can be viewed as a symbol in some huge alphabet. This has 
as consequence that some apparently elementary operations on symbols, like 
the test for equality, often needs a careful definition and may require a delicate 
implementation. 

We denote as usual by A* the set of words over A and by ¢ the empty 
word. For a word w, we denote by |w| the length of w. We use the notation 
At = A* — {ce}. The set A* is a monoid. Indeed, the concatenation of words 
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is associative, and the empty word is a neutral element for concatenation. The 
set At is sometimes called the free semigroup over A, while A* is called the free 
monoid. 

A word w is called a factor (resp. a prefix, resp. a suffix) of a word u if there 
exist words x,y such that u = xwy (resp. u = wy, resp. u = zw). The factor 
(resp. the prefix, resp. the suffix) is proper if xy # « (resp. y#¢, resp. x # €). 
The prefix of length k of a word w is also denoted by w[0..k — 1]. 


a, 
/, OS 
KK 


aaa aab aba abb baa . l \ b 


Figure 1.1. The tree of the free monoid on two letters. 


The set of words over a finite alphabet A can be conveniently seen as a tree. 
Figure 1.1 represents {a,b} as a binary tree. The vertices are the elements of 
A*. The root is the empty word ¢. The sons of a node x are the words xa for 
a € A. Every word x can also be viewed as the path from leading from the root 
to the node x. A word z is a prefix of a word y if it is an ancestor in the tree. 
Given two words x and y, the longest common prefix of # and y is the nearest 
common ancestor of x and y in the tree. 

A word x is a subword of a word y if there are words uj,...,Un and vo, V1, 

.,Un such that # = uy---Upn and y = vpt1V1 +++ UnUn. Thus, x is obtained 
from y by erasing some factors in y. 

Given two words x and y, a longest common subword is a word z of maximal 
length that is both a subword of a and y . There may exist several longest 
common subwords for two words x and y. For instance, the words abc and acb 
have the common subwords ab and ac. 

We denote by alph w the set of letters having at least one occurrence in the 
word w. 

The set of factors of a word x is denoted F(x). We denote by F'(4) the set of 
factors of words in a set ¥ C A*. The reversal of a word w = a ,a2::- ay, where 
@1,...,@n, are letters, is the word w = anady_1--+- a1. Similarly, for % C A*, we 
denote ¥ = {% | x € X}. A palindrome word is a word w such that w = w. 
If |w] is even, then w is a palindrome if and only if w = x% for some word «. 
Otherwise w is a palindrome if and only if w = xax for some word a and some 
letter a. 
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An integer p > 1 is a period of a word w = a ,a2-+-an where a; € A if 
a; = Gi4p fori =1,...,2—p. The smallest period of w is called the period the 
minimal period of w. 

A word w € A? is primitive if w = u™ for u€ AT implies n = 1. 

Two words x,y are conjugate if there exist words u,v such that « = uv and 
y = vu. Thus conjugate words are just cyclic shifts of one another. Conjugacy 
is thus an equivalence relation. The conjugacy class of a word of length n 
and period p has p elements if p divides n and has n elements otherwise. In 
particular, a primitive word of length n has n distinct conjugates. 


1.1.1. Ordering 


There are three order relations frequently used on words. We give the definition 
of each of them. 

The prefix order is the partial order defined by aw < y if x is a prefix of y. 

Two other orders, the radix order and the lexicographic order are refinements 
of the prefix order which are defined for words over an ordered alphabet A. Both 
are total orders. 

The radix order is defined by x < y if |x| < |y| or |a| = |y| and 2 = uaz’ and 
y = uby’ with a, b letters and a < b. If integers are represented in base k without 


\ / 


100 101 110 111 
1000 1001 1010 1011 1100 1101 1110 1111 


a 


Figure 1.2. The tree of integers in binary notation. 


leading zeroes, then the radix order on their representations corresponds to the 
natural ordering of the integers. If we allow leading zeroes, the same holds 
provided the words have the same length (which always can be achieved by 
padding). 

For k = 2, the tree of words without leading zeroes is given in Figure 1.2. 
The radix order corresponds to the order in which the vertices are met in a 
breadth-first traversal. The index of a word in the radix order is equal to the 
number represented by the word in base 2. 

The lexicographic order, also called alphabetic order, is defined as follows. 
Given two words x,y, we have x < y if x is a proper prefix of y or if there exist 
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factorizations 7 = uaz’ and y = uby’ with a,b letters and a < b. This is the 
usual order in a dictionary. Note that x < y in the radix order if |x| < |y| or if 
|x| = |y| and a < y in the lexicographic order. 


1.1.2. Distances 


A distance over a set FE is a function d which assigns to each element of E a 
nonnegative number such that: 


(i) d(u,v) = d(v,u), 
(ii) d(u,w) < d(u,v) + d(v, w) (triangular inequality) 
(iii) d(u,v) = 0 iff u =v. 


Several distances between words are used. The most common is the Hamming 
distance. It is only defined on words of equal length. For two words u = 
ag +++G@y—1 and v = bo-+-bn—1, where a;, b; are letters, it is the number dy(u, v) 
of indices i with 0 <i <n-—1 such that a; 4 b;. In other terms 


dy(u,v) = Card{i|0<i<n—landa;, 4);}. 


Thus the Hamming distance is the number of mismatches between u and v. It 
can be verified that dy is indeed a distance. Observe that dy(u,v) = n—p 
where p is the number of positions where u nd v coincide. In a more general 
setting, a distance between letters is used instead of just counting for 1 each 
mismatch. 

The Hamming distance takes into account the differences at the same posi- 
tion. In this way, it can be used as a measure of modifications or errors caused 
by a modification of a symbol by another one, but not of a deletion or an inser- 
tion. Another distance is the subword distance which is defined as follows. Let 


abaababa abaababa 
abbabaab abbabaab 
(a) Hamming distance (b) Subword distance 


Figure 1.3. The Hamming distance is 3 and the subword distance is 2. 


u be a word of length n and v be a word of length m, and p be the length of a 
longest common subword of u and v. The subword distance between u and v is 
defined as ds(u,v) =n+m-— 2p. It can be verified that ds(u,v) is the minimal 
number of insertions and suppressions that changes u into v. The name indel 
(for insertions and deletions) is used to qualify the transformation consisting in 
either an insertion or a deletion. 


Version June 23, 2004 


1.2. Elementary algorithms 7 


A common generalization of the Hamming distance and the subword distance 
is the edit distance. It takes into account the substitutions of a symbol by 
another in additions to indels (see Problem 1.1.2). 

A related distance is the prefix distance. It is defined as d(u, v) = n+m-— 2p 
where n = |u|, m = |v| and p is the length of the longest common prefix of u 
and v. It can be verified that the prefix distance is actually the length of the 
shortest path from u to v in the tree of the free monoid. 


abbabaabbaababbabaababbaabbabaab 


ISISSSSS AK A 


baababbaabbabaababbabaabbaababdbba 


Figure 1.4. The Hamming distance of these two Thue-Morse blocks of 
length 32 is equal to their length, their subword distance is only 6. 


1.2. Elementary algorithms 


In this section, we treat algorithmic problems related to the basic notions on 
words: prefixes, suffixes, factors. 


1.2.1. Prefixes and suffixes 


Recall that a word « is a prefix of a word y if there is a word u such that 
y = xu. It is said to be proper if u is non empty. Checking whether z is a prefix 
of y is straightforward. Algorithm LONGESTCOMMONPREFIX below returns the 
length of the longest common prefix of two words « and y. 


LONGESTCOMMONPREFIX(a, y) 
1 paz has length m, y has length n 
iO 
while i < m and i < n and 2[?] = y[i] do 
a—itl 
return 7 


oR Wb 


In the tree of a free monoid, the length of the longest common prefix of two 
words is the height of the least common ancestor. 

As mentioned earlier, the conceptual simplicity of the above algorithm hides 
implementation details such as the computation of equality between letters. 


1.2.2. Overlaps and borders 


We introduce first the notion of overlap of two words x and y. It captures the 
amount of possible overlap between the end of x and the beginning of y. To 
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avoid trivial cases, we rule out the case where the overlap would be the whole 
word x or y. Formally, the overlap of x and y is the longest proper suffix of 
x that is also a proper prefix of y. For example, the overlap of abacaba and 
acabaca has length 5. The border of a non empty word w is the overlap of w 
and itself. Thus it is the longest word u which is both a proper prefix and a 
proper suffix of w. The overlap of x and y is denoted by overlap(z, y), and the 
border of w by border(a). Thus border(a) = overlap(z, x). 

As we shall see, the computation of the overlap of x and y is intimately 
related to the computation of the border. This is due to the fact that the 
overlap of x and y involves the computation of the overlaps of the prefixes of x 
and y. Actually, one has overlap(xa, y) = border(aa) whenever « is a prefix of 
y and a is a letter. Next, the following formula allows the computation of the 
overlap of xa and y, where x, y are words and a is a letter. Let z = overlap(z, y). 


Za if za is a prefix of y, 


1 — 
overlap(xa, y) { border(za) otherwise. 


Observe that border(za) = overlap(za, y) because z is a prefix of y. The com- 
putation of the border is an interesting example of a non trivial algorithm on 
words. A naive algorithm would consist in checking for each prefix of w whether 
it is also a suffix of w and to keep the longest such prefix. This would obviously 
require a time proportional to |w|?. We will see that it can be done in time 
proportional to the length of the word. This relies on the following recursive 
formula allowing to compute the border of xa in terms of the border of x, where 
x is a word and a is a letter. Let u = border(x) be the border of «. Then for 
each letter a, 


ua if ua is a prefix of 2, 


border(xa) = { (1.2.1) 


border(ua) otherwise. 

The following algorithm (Algorithm BORDER) computes the length of the 
border of a word «x of length m. It outputs an array b of m+ 1 integers such that 
b[j] is the length of the border of [0..j—1]. In particular, the length of border(z) 
is b[m]. It is convenient to set b[0] = —1. For example, if « = abaababa, the 
array b is 


0 1 2 3 4 5 6 7 8 
» GPPPEPPEPTs 
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BORDER(z) 

1 pa has length m, b has size m+ 1 

2 tO 

3 b/O] — -1 

4 for j7<—1tom-—1do 

5 bij] <4 

6 > Here x[0..i — 1] = border(a[0..7 — 1]) 
7 while i > 0 and x|j] 4 z[?] do 

8 i — Di] 

9 imit+l 
0 blm]) —i 
1 return b 


This algorithm is an implementation of Formula (1.2.1). Indeed, the body 
of the loop on j computes, in the variable 7, the length of the border of «x(0..j]. 
This value will be assigned to b[j] at the next increase of j. The inner loop is a 
translation of the recursive formula. 

The algorithm computes the border of x (or the table 0 itself) in time O(|2]). 
Indeed, the execution time is proportional to the number of comparisons of 
symbols performed at line 7. Each time a comparison is done, the expression 
2j — i increases strictly. In fact, either z[j] = x[i] and i,7 increase both by 1. 
Or z[j] 4 xi], and 7 remains constant while 7 decreases strictly (since b[i] < i). 
Since the value of the expression is initially 0 and is bounded by 2|z|, the number 
of comparisons is at most 2|z]. 

The computation of the overlap of two words x,y will be done in the next 
section. 


1.2.3. Factors 


In this section, we consider the problem of checking whether a word « is a 
factor of a word y. This problem is usually referred to as a string matching 
problem. The word =z is called the pattern and y is the text. A more general 
problem, referred to as pattern matching, occurs when zx is replaced by a regular 
expression X (see Section 1.4. The evaluation of the efficiency of string matching 
or pattern matching algorithms depends on which parameters are considered. In 
particular, one may consider the pattern to be fixed (because several occurrences 
of the same pattern are looked for in an unknown text), or the text to be fixed 
(because several different pattern will be searched in this text). When the 
pattern or the text is fixed, it may be subject to a preprocessing. Moreover, the 
evaluation of the complexity can take into account either only the computation 
time, or both time and space. This may make significant difference on very 
large texts and patterns. 

We begin by a naive quadratic string searching algorithm. To check whether 
a word « is a factor of a word y, it is clearly enough to test for each index 
j=0,...,n—1 if a is a prefix of the word y[j..n — 1]. 
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NAIVESTRINGMATCHING(2, y) 
1 pa has length m, y has length n 
(i, 3) — (0,0) 
while i < mand j <n do 
if x{?] = y[j] then 
(i,9) — @+1,5 +1) 
else j— j-—i+1 
iO 


aABaAIonkRWh 


return 2 =m 


The number of comparisons required in the worst case is O(|x||y|). The 
worst case is reached for « = a™b and y = a”. The number of comparisons 
performed is in this case m(n — m — 1). 

We shall see now that it is possible to search a word x inside another word y in 
linear time, that is in time O(||+|y|). The basic idea is to use a finite automaton 
recognizing the words ending with x. If we can compute some representation of 
it in time O(|z|), then it will be straightforward to process the word y in time 
O(lyl). 

The wonderfully simple solution presented below uses the notion of border 
of a word. Suppose that we are in the process of identifying x inside y, the 
position 7 in x being placed in front of position 7 in y, as in the naive algorithm. 
We can set then « = ubt where b = [i] and y = wuaz where a = y|j]. If a=b, 
the process goes on with 1+ 1,7 +1. Otherwise, instead of just shifting x to 
the right one place (i.e. 7 = 7 —2 +1,i = 0), we can take into account that 
the next possible position for x is determined by the border of wu. Indeed, we 
must have y = w'u’az and x = u’ct’ with u’ both a prefix of u and a suffix of 
u since w'u’ = wu. Hence the next comparison to perform is between y|j] and 
z[k] where k — 1 is the length of the border of u. 


bi] 


Figure 1.5. Checking y[j] against [7]: if they are different, y[j] is checked 
against x[b/i]]. 


The algorithm is realized by the following program (Algorithm SEARCHFAC- 
TOR). It returns the starting position of the first occurrence of the word x inside 
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the word y, and |y| if a is not a factor of y. It uses an array b of |x| + 1 integers 
such that b[i] is the length of the border of x[0..i — 1). 


SEARCHFACTOR(2, y) 
1 pa has length m, y has length n 

> bis the array of length of borders of the prefixes of « 
b — BORDER(2x) 
(i, 3) — (0,0) 
while 7 < mand j < n do 

while i > 0 and 2{?] 4 y[j] do 

i — Di] 

(i,j) —G+1,9+0) 

return 7 =m 


OOnNDoK WwW DP 


The time complexity is O(|z| + |y|). Indeed, the computation of the array 
b can be done in time O(|z|) as in Section 1.2.2. Further, the analysis of the 
algorithm given by the function SEARCHFACTOR is the same as for the function 
BORDER. The expression 27 — i increases strictly at each comparison of two 
letters, and thus the number of comparisons is bounded by 2]y|. Thus, the 
complete time required to check whether x is a factor of y is O(|z| + |y|) as 
announced. 

Computing the overlap of two word x, y can be done as follows. We may 
suppose |a| < |y|. The value of overlap(a, y) is the final value of the variable 7 
in the algorithm SEARCHFACTOR applied to the pair (y, x) 


1.2.4. Subwords 


We now consider the problem of looking for subwords. The following algorithm 
checks whether x is a subword of y. In contrast to the case of factors, a greedy 
algorithm suffices to perform the check in linear time. 


ISSUBWORD(a, y) 
1 pa has length m, y has length n 


2 (i,j) — (0,0) 

3 while i < mand j <n do 
4 if x{?] = y[j] then 

5 i-itl 

6 jojtl 

7 return 2=m™ 


We denote by lcs(z,y) the set of longest common subwords (also called 
longest common subsequences) of two words x and y. The computation of 
the longest common subwords is a classical algorithm with many practical uses. 
We present below a quadratic algorithm. It is based on the following formula. 


les(x, y)a ifa=b, 


tes(ca, yb) = { 


max(Ics(xa, y),les(a, yb)) otherwise. 
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where max() stands for the union of the sets if their elements have equal length, 
and for the set with the longer words otherwise. 

In practice, one computes the length of the words in lces(a, y). For this, define 
an array M{i, 3] by M|t, 7] = & if the longest common subwords to the prefixes 
of length 7 of x and 7 of y have length k. The previous formula then translates 
into 
Mii, j] +1 ifa=b, 


Mi+1j+i= 
Pelee ae otherwise. 


For instance, if « = abba and y = abab, the array M is the following. 


The first row and the first column of the array M are initialized at 0. The 
following function computes the array M. 


LcosLENGTHARRAY (2, y) 
1 pa has length m and y has length n 
2 fori-0to m-—1do 
3 for 70 ton—1do 


4 if x[i] = y[j] then 

5 Mfi+1,j+1) — Mli,j)+1 

6 else Miji+1,j +1] — max(M[i+1, J], M[z,7 + 1)) 
7 return VM 


The above algorithm has quadratic time and space complexity. Observe that 
the length of the longest common subwords, namely the value M[m, n], can be 
computed in linear space (but quadratic time) by computing the matrix M row 
by row or column by column. To recover the a word in lcs(z, y), it is enough to 
walk backwards through the array M. 
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Loes(a, y) 
1 p result is a longest common word w 
M — LosLENGTHARRAY(z, y) 
(i,9,k) — (m—1,n—1, M[m,n] — 1) 
while k > 0 do 
if x{i] = y[j] then 
wlk] — a[i] 
(i,j,k) — (i—1,7 -1,k—-1) 
else if M[i+1,j] < M[i,j +1] then 
t-—a-1 
else 7 —j-—1 
return w 


FOO ANDO KW bh 


—_—a 


1.2.5. Conjugacy and Lyndon words 


Two words x, y are said to be conjugate if x = uv,y = vu, for some words u, v. 
Thus two words are conjugate if they differ only by a cyclic permutation of their 
letters. 

To check whether x and y are conjugate, we can compare all possible cyclic 
permutations of x with y. This requires O(|zx||y|) operations. Actually we can do 
much better as follows. Indeed, x and y are conjugate if and only if |a| = |y| and 
if x is a factor of yy. Indeed, if |x| = |y| and yy = uaxv, we have |y| < |u|, |xv| 
and thus there are words u’,v’ such that 2 = v'u’ and y = uv’ = u’v. Since 
|x| = |y|, we have |u’| = |u|, whence u = u’ and v = vu’. This shows that 
L=vu,y = uv. 

Hence, using the linear time algorithm SEARCHFACTOR of Section 1.2.3, we 
can check in O(|2| + |y|) whether two words x,y are conjugate. 

Recall that a Lyndon word is a word which is strictly smaller than any of its 
conjugates for the alphabetic ordering. In other terms, a word x is a Lyndon 
word if for any factorization « = uv with u,v non empty, one has uv < vu. A 
Lyndon word is in particular primitive. 


k kt+j-i k+j 
| | | 


co 


k+4 


Figure 1.6. Checking whether 2[k..k + m — 1] is the least circular con- 
jugate of x. 


Any primitive word has a conjugate which is a Lyndon word, namely its 
least conjugate. Computing the smallest conjugate of a word is a practical way 
to compute a standard representative of the conjugacy class of a word (this is 
sometimes called canonization). This can be done in linear time by the following 
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algorithm, which is a modification of the algorithm BORDER of Section 1.2.2. 
It is applied to a word x of length m. We actually use an array containing 27, 
and called this array x. Of course, an array of length m would suffice provided 
the indices are computed mod m. 


CIRCULARMIN(z) 
1 (i,5,k) — (0,1,0) 
2 b{0) ——1 
3 while k+ j < 2m do 
4 > Here a|k..k + i — 1] = border(a|k..k + j — 1)) 
5) if j -1 =m then 
6 return i 
4 bj] — 4 
8 while i > 0 and 2[k+ j] 4 z[k + i] do 
9 if a[k +3] < a[k + 2] then 
10 (k, 9) — (K+ 9 -14,2) 
iat i — Dii] 
12 (i,j) — G@41,74+1) 


Algorithm CIRCULARMIN looks like Algorithm BORDER. Indeed, if we dis- 
card lines 5-6 and lines 9-10 in algorithm CIRCULARMIN, the variable k remains 
0 and we obtain an essentially equivalent algorithm (with a while loop replac- 
ing the for loop). The key assertion of this algorithm is that a2[k..k + 7-1] = 
border(a[k..k+7—1]) , as indicated at line 4. This is the same as the assertion in 
Algorithm BORDER for k = 0. The array b contains the information on borders, 
in the sense that b[j] is the length of border(a[k..k + 7 — 1)). 

The value of k is the index of the beginning of a candidate for a least con- 
jugate of x (see Figure 1.6). If the condition at line 9 holds, a new candidate 
has been found. The assignment at line 10 shifts the value of k by j — 7, and j 
is adjusted in such a way that the value of k +7 is not modified. The modifica- 
tions of the value of k does not require the entire re-computation of the array 
b. Indeed, the values b[j’] for 0 < 7’ < 7% serve both for the old and the new 
candidate. For the same reason as for Algorithm BORDER, the time complexity 
is linear in the size of x. 

Any word admits a unique factorization as a non increasing product of Lyn- 
don words. In other words, for any word 2, there is a factorization 


— pri nm 
i ee i ae 


where r > 0, n4,...,N, > 1, and ¢; >.--- > ¢, are Lyndon words. We discuss 
now an algorithm to compute this factorization. 

The following program computes the pair (¢,,n1) for x in linear time. By 
iteration, it allows to compute the Lyndon factorization in linear time. 
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LYNDONFACTORIZATION(2) 
1 pa has length m 
(i,7) — (0,1) 
while j < m and z|?] < xj] do 
if a[i] < a[j] then 
1<—O0 
else it i+1 
jogjgtl 
return (j — i, |i/(j —i)]) 


aBaAIonkwWh 


The idea of the algorithm is the following. Assume that at some step, we 
have « = (py, where @ is a Lyndon word, n > 1 and p is a proper prefix of 
£. The pair (¢,7) is a candidate for the value (¢1,71) of the factorization. The 
relation with the values i,j of the program is given by 7 = |€"p|, 7 —7 = |4, 
n= |j/(j —2)| . Let a = afi], b = a[j]. Then @ = pag for some word gq, and 
v= l'pbz. If a < b, then ’ = "pb is a Lyndon word. The pair (¢’, 1) becomes 
the new candidate. If a = b, then pb replaces p. Finally, if a > b the pair (¢,n) 
is the correct value of (¢),71). 

The above algorithm can also be used to compute the Lyndon word ¢ in the 
conjugacy class of a primitive word x. Indeed, @ is the only Lyndon word of 
length |x| that appears in the Lyndon factorization of xx. Thus, Algorithm Lyn- 
DONFACTORIZATION gives an alternative to Algorithm CIRCULARMIN. 


1.3. Tries and automata 


In this section, we consider sets of words. These sets arise in a natural way 
in applications. Dictionaries in natural language processing, or more generally 
text processing in data bases are typical examples. Another situation is when 
one considers properties of words, and the sets satisfying such a property, for 
example the set of all words containing a given pattern. We are interested in 
the practical representation for retrieval and manipulation of sets of words. 

The simplest case is the case of finite, but possibly very large sets. General 
methods for manipulation of sets may be used. This includes hash functions, 
bit vectors, and various families of search trees. These general methods are 
sometimes available in programming packages. We will be interested here in 
methods that apply specifically to sets of words. 

Infinite sets arise naturally in pattern matching. The natural way to han- 
dle them is by means of two equivalent notions: regular expressions and finite 
automata. We describe here in some detail these approaches. 


1.3.1. Tries 


A trie is the simplest non trivial structure allowing to represent a finite set VY 
of words. It has both the advantage of reducing the space required to store the 
set of words and to allow a fast access to each element. 
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A trie R is a rooted tree. Each edge is labelled with a letter. The labels have 
the property that two distinct edges starting in the same vertex have distinct 
labels. A subset T of the set of vertices is called the set of terminal vertices. 
The set ¥ of words represented by the trie R is the set of labels of paths from 
the root to a vertex in T. It is convenient to assume that every vertex is on 
a path from the root to some vertex in T (since otherwise the vertex could be 
removed). In particular, every leaf of the tree is a terminal vertex. 


EXAMPLE 1.3.1. The trie represented on Figure 1.3.1 represents the set 
X = f{leader, let, letter, sent}. 


The terminal vertices are doubly circled. 


a me) 


Figure 1.7. A trie 


To implement a trie, we use a function NEXT(p,a) which gives the vertex q 
such that the edge (p,q) is labeled a. We assume that NEXT(p,a)= —1 if the 
function is not defined. The root of the tree is the value of ROOT(). 


IsINTRIE(w) 
1 p> checks if the word w of length n is in the trie 
2 (i,p) — LONGESTPREFIXINTRIE(w) 
3 return 7 =n and p is a terminal vertex 


Function ISINTRIE returns true if the word w is in the set represented by 
the trie. It uses the function LONGESTPREFIXINTRIE() to compute the pair 
(i,p) where 7 is the length of the longest prefix of w which is the label of a path 
in the trie, and p is the vertex reached by this prefix. For future use, we give a 
slightly more general version of this function. It computes the pair (i, p) where 
i is the length of the longest prefix of the suffix of w starting in position j. 
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LONGESTPREFIXINTRIE(w, j) 
1 p returns the length of the longest prefix of w[j..n — 1] 
2 pin the trie, and the vertex reached by this prefix. 


3. gq RooT() 

4 fori—jton—1do 

5 p-—q 

6 q— NEXxT(q, w|?]) 

7 if q is undefined then 
8 return (i — j,p) 
9 return (n— j,q) 


Searching a word in a trie is done in linear time with respect to the length of 
the word. It does not depend on the size of the trie. This is the main advantage 
of this data structure. However, this is only true under the assumption that the 
function NEXT can be computed in constant time. In practice, if the alphabet 
is of large size, this might not be longer true. 

To add a word to a trie amounts to the following simple function. 


ADDTOTRIE(w) 
1 p adds the word w to the trie. 
(i, p) — LONGESTPREFIXINTRIE(w, 0) 
for j —iton—1do 
q — NEWVERTEX() 
NEXT(p, w[j]) — ¢ 
pd 
Add q to the set of terminal vertices 


NOoKR Wb 


We use a function NEWVERTEX() to create a new vertex of the trie. The 
function ADDTOTRIE() is linear in the length of w, provided NEXT() is in 
constant time. 

To remove a word from a trie is easy if we have a function FATHER() giving 
the father of a vertex. The function can be tabulated during the construc- 
tion of the trie (by adding the instruction FATHER(q) <— p just after line 5 
in Algorithm ADDTOTRIE). The function FATHER() can also be computed 
on the fly during the computation of LONGESTPREFIXINTRIK() at line 2 of 
Algorithm REMOVEFROMTRIE(.) Another possibility, avoiding the use of the 
function FATHER(), is to write the function REMOVEFROMTRIE() recursively. 
We also use a boolean function ISLEAF() to test whether a vertex is a leaf or 
not. 


Version June 23, 2004 


18 Algorithms on Words 


REMOVEFROMTRIE(w) 
1 p removes the word w of length n from the trie 
(i, p) — LONGESTPREFIXINTRIE(w, 0) 
> 7 should be equal to n 
Remove p from the set of terminal vertices 
while ISLEAF(p) and p is not terminal do 
(i,p) — (i — 1, FATHER(p)) 
NEXT(p, wli]) — —1 


NOOR Wb 


The use of a trie structure reduce the space needed to represent a set of 
words, compared with a naive representation. If one wishes to further reduce 
the size, it is possible to use an acyclic graph instead of a tree. The result is an 
acyclic graph with labeled edges, an initial vertex and a set of terminal vertices. 
This is sometimes called a directed acyclic word graph abbreviated as DAWG. 


EXAMPLE 1.3.2. We represent below a DAWG for the set 
X = {leader, let, letter, sent} 


of Example 1.3.1. 


Figure 1.8. A directed acyclic word graph (DAWG). 


For a given finite set ¥ of words, there is a unique minimal DAWG repre- 
senting V. This is a particular case of a statement concerning finite automata 
that we shall see in the next section. The minimal DAWG is actually the min- 
imal deterministic automaton recognizing VY, and standard algorithm exist to 
compute it. 


1.3.2. Automata 


An automaton over an alphabet A is composed of a set Q of states, a finite set 
ECQ~x A* x Q of edges or transitions and two sets I,T C Q of initial and 
terminal states. For an edge e = (p,w,q), the state p is the origin, w is the 
label, and q is the end. 
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The automaton is often denoted &% = (Q, £,I,T), or also (Q,1,T) when E 
is understood, or even 2% = (Q, F) if Q=I1=T. 
A path in the automaton 2 is a sequence 


(po, W1,P1), (p1, W2, P2), re) (Daca Wn; Pn) 


of consecutive edges. Its label is the word x = w,w2---w,. The path starts at 
po and ends at p,. The path is often denoted 


Po — Pn 


A path is successful if it starts in an initial state and ends in a terminal state. 
The set recognized by the automaton is the set of labels of its successful paths. 

A set is recognizable or regular if it is the set of words recognized by some 
automaton. 

The family of regular sets is both the simplest family of sets that admits 
an algorithmic description. It is also the most widely used one, because of its 
numerous closure properties. 

A state p is accessible if there is a path starting in an initial state and ending 
in p. It is coaccessible if there is a path starting in p and ending in a terminal 
state. An automaton is trim if every state is accessible and coaccessible. 

An automaton is unambiguous if, for each pair of states p,q, and for each 
word w, there is at most one path from p to q labeled with w. An automaton 
is represented as a labelled graph. Initial (final) states are distinguished by an 
incoming (outgoing) arrow. 


EXAMPLE 1.3.3. The automaton given in Figure 1.9 recognizes the set of words 
on the alphabet {a,b} ending with aba. It is unambiguous and trim. 


Figure 1.9. A nondeterministic automaton. 


The definition of an automaton given above is actually an abstraction which 
went up from circuits and sequential processes. In this context, an automaton 
is frequently called a state diagram to mean that the states represent possible 
values of time changing variables. 

In some situations, this representation is not adequate. In particular, the 
number of states can easily become too large. Indeed, the number of states 
is in general exponential in the number of variables. A typical example is the 
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automaton which memorizes the n last input symbols. It has 2” states on a 
binary alphabet but can be represented simply with n binary variables. Observe 
however that this situation is not general. In particular, automata occurring 
in linguistics or in bioinformatics cannot in general be represented with such 
parsimony. 

We have introduced here a general model of automata which allows edges 
labelled by words. This allows in particular edges labelled by the empty word. 
Such an edge is usually called an ¢-transition. We will use here two particular 
cases of this general definition. The first is that of a synchronous automaton in 
which all edges are labelled by letters. In this case, the length of a path equals 
the length if its label. 

An automaton which is not synchronous is called asynchronous. Among 
asynchronous automata, we use literal automata as a second class. These have 
labels that are either letters or the empty word. In this case, the length of a 
path is always at least equal to the length of its label. 


EXAMPLE 1.3.4. The automaton 2 of Figure 1.10 is asynchronous but literal. 


It recognizes the set a*b*. 
a b 
| 1 | = | 2 | > 


Figure 1.10. A literal automaton for the set a*b*. 


An automaton is deterministic if it is synchronous, it has a unique initial 
state, and if, for each state p and each letter a, there is at most one edge which 
starts at p and is labeled by a. The end state of the edge is denoted by p- a. 
Clearly, a deterministic automaton is unambiguous. For any word w, there is 
at most one path starting in p and labeled w. The end state of this is denoted 
p-w. Clearly, for any state p and any words u,v, one has 


p: uv =(p-u)-v 


provided the paths exist. 

An automaton is complete if for any state p and any letter a there exists an 
edge starting in p and labelled with a. Any automaton can be completed, that is 
transformed into a complete automaton by adding one state (frequently called 
the sink) and by adding transitions to this state whenever they do no exist in 
the original automaton. 


EXAMPLE 1.3.5. The automaton given in Figure 1.11 is deterministic. It rec- 
ognizes the set of words having no occurrence of the factor aa. It is frequently 
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called the Golden mean automaton, because the number of words of length n it 
recognizes is the Fibonacci number F,, (with the convention Fp = 0 and F) = 1). 


Figure 1.11. The golden mean automaton. 


An automaton is finite if its set of states is finite. Since the alphabet is 
usually assumed to be finite, this means that the set of edges is finite. 

A set of words ¥ over A is recognizable if it can be recognized by a finite 
automaton. 

The implementation of a deterministic automaton with a finite set of states 
Q, and over finite alphabet A, uses the neat-state function which is the partial 
function NEXT(p,a) = p-a. In practice, the states are identified with integers, 
and the next-state function is given either by an array or by a set of edges (a, q) 
for each state p. The set may be either hashed, or listed, or represented in some 
balanced tree structure. Other representations exist with the aim of reducing 
the space while preserving the efficiency of the access. 

The next-state function is extended to a function again called NEXT and 
defined by NEXT(p, w) = p-w, for a word w. A practical implementation has to 
choose a convenient way to represent the case where the function is undefined. 


NEXT(p, w) 
1 pw has length n 
2 fori-—0Oton—1do 
3 p — NEXT(p, w{?]) 
4 if p is undefined then 
5 break 
6 return p 


EXAMPLE 1.3.6. For the Golden mean automaton, the next-state function is 
represented by the following table (observe that 2 - a is undefined) 


For the implementation of nondeterministic automata, we restrict ourselves 
to the case of a literal automaton which is the most frequent one. For each state, 
the set of outgoing edges is represented by sets NEXT(p, a) for each letter a, and 
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NEXT(p, €) for the ¢-transitions. By definition NEXT(p, a) = {q | (p,a,q) € FE}, 
and NExT(p,¢) = {q | (p,¢,qg) € E}, where E denotes the set of edges. We 
denote by INITIAL the set of initial states, and by TERMINAL the set of terminal 
states. 

In order to check whether a word is accepted by a nondeterministic automa- 
ton, one performs a search in the graph controlled by the word to be processed. 
We treat this search in a breadth-first manner in the sense that, for each prefix 
p of the word, we compute the set of states reachable by p. 

For this, we start with the implementation of the next-state function for a 
set of states. We give a function NEXT(S,a) that computes the set of states 
reachable from a state in S by a path consisting of an edge labelled by the letter 
a followed by a path labelled ¢. An other possible choice consists in grouping the 
é-transitions before the edge labelled by a. This will be seen in the treatment 
of the computation of a word. 


NEXT(S, a) 
1 pS isa set of states, and a is a letter 
2 T-9O 
3 for qe Sdo 
4 T — TUNEXT(q, a) 
5 return CLOSURE(T) 


The function CLOSURE(T’) computes the set of states accessible from states 
in T by paths labelled ¢. This is just a search in a graph, and it can be per- 
formed either depth-first or breadth-first. The time complexity of the function 
NEXT(S,a) is O(d- Card(S)), where d is the maximal out-degree of a state. 

The function NEXT() extends to words as follows. 


NEXT(S, w) 
1 pS isa set of states, and w is a word of length n 
T <— CLOSURE(S) 
for i 0 to n—1do 
T — NExt(T, w?]) 
return T' 


oR Wh 


In order to test whether a word w is accepted by an automaton, it suffices 
to compute the set S = NEXT(INITIAL, w), and to check whether S meets the 
set of final states. This is done by the following function. 


IsACCEPTED(w) 
1 pS isa set of states 
2 S — NEXT(INITIAL, w) 
3 return SM TERMINAL 4 0 


The time complexity of the function ACCEPT(w) is O(nmd), where m is the 
number of states and d is the maximal out-degree of a state. Thus, in all cases, 
the time complexity is O(nm7). 
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1.3.3. Determinization algorithm 


Instead of exploring a nondeterministic automaton, one may compute an equiv- 
alent deterministic automaton and perform the acceptance test on the resulting 
deterministic automaton. This preprocessing is especially interesting when the 
same automaton is going to be used on several inputs. However, the size of 
the deterministic automaton may be exponential in the size of the original, non 
deterministic one, and the direct search may be the unique realistic option. 

We now show how to compute an equivalent deterministic automaton. The 
states of the deterministic automaton are sets of states, namely the sets com- 
puted by the function NEXxT(). A practical implementation of the algorithm 
will use an appropriate data structure for a collection of sets of states. This can 
be a linked list or an array of sets. We only need to be able to add elements, 
and to test membership. 

The function EXPLORE() consists essentially in searching, in the automaton 
%8 under construction, the states that are accessible. As for any exploration, 
several strategies are possible. We use a depth-first search realized by recursive 
calls of the function EXPLORE(). 


EXPLORE(T, S, 8) 
1 pT is a collection of sets of states of 2 
2 pT is also the set of states of B 
3 > S is an element of T 
4 for each letter c do 
5 U — NEXT, (S,c) 
6 NEXT 93 (5,c) — U 
7 if UA and U ¢T then 
8 T-TuUU 
9 (T,8) — EXxpiore(T, U, 8) 
10 return (7,8) 


We can now write the determinization algorithm. 


NFATODFA (2) 
1 pb Wis a nondeterministic automaton 
%— NEwDFA() 
I — CLosurRE(INITIALg ) 
INITIAL <— I 
> JT is a collection of sets of states of 2 
Tcl 
(7,8) — EXpLore(T, J, 8) 
TERMINAL « {U €T | UN TERMINALy 4 0} 
return § 


COON DOK WP 


The result of Algorithm NFATODFA is the deterministic automaton B. Its 
set of states is the set J. In practice, it can be represented by a set of integers 
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coding the elements of T, as the collection T itself is not needed any more. The 
complexity of Algorithm NFATODFA is proportional to the size of the resulting 
deterministic automaton times the complexity of testing membership in line 7. 
EXAMPLE 1.3.7. We show ina first example the computation of a deterministic 
automaton equivalent to a nondeterministic one. We start with the automaton 
Ql given in Figure 1.12. We have INITIAL, = {1,2} and TERMINALg = {1}. 


b 


Figure 1.12. The nondeterministic automaton 2. 


The next-state function is given by the following table 


The collection T of sets of states of the resulting automaton computed by 
Algorithm NFATODFA is T = {{1,2}, {1}}. The automaton is represented in 
Figure 1.13. It is actually the Golden mean automaton of Example 1.3.5. 


a 
ioc aro 
b 
Figure 1.13. The deterministic version 8 of 2. 


EXAMPLE 1.3.8. As a second example, we consider the automaton 2 of Ex- 
ample 1.3.4. We have INITIALg = {1}, and CLOSURE(INITIALy) = {1,2}. The 
resulting deterministic automaton is given in Figure 1.14 


EXAMPLE 1.3.9. For any set K of words, let F(K) denote the set of factors of 
the words in K. We are going to verify a formula involving the shuffle of two 
sets of words. Formally, the shuffle operator m is defined inductively on words 
by ume = emu=u and 


(um v)a ifa=b 


naa { (wamv)bU (umvb)a otherwise. 
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Figure 1.14. A deterministic automaton for the set a*b*. 


The shuffle of two sets is the union of the shuffles of the words in the sets. 
The formula is the following 


F((ab)*) m F((ab)*) = F((ab + ba)*). (1.3.1) 


This equality is the basis of a card trick known as Gilbreath’s card trick (see 
Notes). 

In order to prove this formula, we apply a general principle that is valid for 
regular sets. It consists in computing deterministic automata for each side of 
the equation and to check that they are equivalent. 

The set F'((ab)*) is recognized by the automaton on the left of Figure 1.15. 
It is easy to see that the set F((ab)*) m F((ab)*) is recognized by the nonde- 
terministic automaton on the right of Figure 1.15, realized by forming pairs of 
states of the first automaton with action on either component. 


(a) F((ab)*) (b) F((ab)*) mt F((ab)*) 


Figure 1.15. Two automata, recognizing F'((ab)*) and F'((ab)*) m F'((ab)*). 


To compute a deterministic automaton, we first renumber the states as in- 
dicated on the left of Figure 1.16. The result of the determinization is shown 
on the right. 

Next, an automaton recognizing (ab+-ba)* is shown on the left of Figure 1.17. 

To recognize the set F'((ab + ba)*), we make all states initial and terminal 
in this automaton. The determinization algorithm is then applied with the new 
initial state {1,2,3}. The result is shown on the right of Figure 1.17. This 
automaton is clearly equivalent to the automaton of Figure 1.16. This proves 
Formula 1.3.1. 
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(a) After renaming sta- (b) Deterministic automaton 
tes. 


Figure 1.16. On the right, a deterministic automaton recognizing the 
set F'((ab)*) m1 F'((ab)*) which is recognized by the automaton on the left. 


(a) An automaton recognizing (b) A deterministic automaton for this set. 
the set F((ab + ba)*). 


Figure 1.17. Two automata recognizing the set F'((ab + ba)*). 


1.3.4. Minimization algorithms 


A given regular language S C A* may be recognized by several different au- 
tomata. There is however a unique one with a minimal number of states, called 
the minimal automaton of S. We will give a description of the minimal automa- 
ton and several algorithms allowing to compute it. 

The abstract definition is quite simple: the states are the nonempty sets of 
the form x~'S for x € A* where 


a 'S={ye A*|ryeS}. 


The initial state is the set S itself (corresponding to « = €) and the final states 
are the sets 21S with x € S (or, equivalently, such that ¢ € 2~1S). There is a 
transition from the state x~'S by letter a € A to the state (ra)~1S. 


EXAMPLE 1.3.10. Let us consider the set S, of words over A = {a,b} that 
have a symbol a at the (n+ 1)th position before the end for some n > 0. 


Version June 23, 2004 


1.3. Tries and automata 27 


Formally, S, = A*aA”. For any © = aga, +++ dm € A*, one has 
a*S,, = Uier(ny A” 


where P(a) = {i|0 <i<nanda,_; = a}. Thus the minimal automaton of 
S,, has 2"*+ states since its set of states is the set of all subsets of {0,,1,...,n}. 
The set S is also recognized by the nondeterministic automaton of Figure 1.18. 
This example shows that the size of the minimal automaton can be exponential, 


a,b 
OG » O24 


Figure 1.18. Recognizing the words which have the letter a at the n+1- 
th position before the end. 


compared with the size of a nondeterministic one. 


A general method for computing the minimal automaton consists in three 
steps. 

(i) Compute a nondeterministic automaton (e.g. by the method explained 
in the next section), 

(ii) Apply the determinization algorithm of the preceding section 1.3.3 and 
remove all states that are not accessible or coaccessible. The resulting automa- 
ton is deterministic and trim. 

(ii) Apply a minimization algorithm, as described below. 

To minimize a deterministic automaton, one uses a sequence of refinements 
of equivalence relations 7) > 7 > --: > 7» in such a way that the classes of 7, 
are the states of the minimal automaton. 

The equivalence relation 7, is called the Nerode equivalence of the automa- 
ton. It is characterized by 


p~ qif and only if £, = Ly, 


where £, is the set of words recognized by the automaton with initial state p. 

The sequence starts with the partition 7 in two classes separating the ter- 
minal states from the other ones. Further, one has p = q mod 7x4, if and only 
if 

p=qmod zm, and p-a=q-amod 7; for alla € A. 

In the above condition, it is understood that p-a = @) if and only q-a = 0. 
A partition of a set with n elements can be simply represented by a function 
assigning to each element x its class c(x). 

The computation of the final partition is realized by the following algorithm 
known as Moore’s algorithm. 
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MoorEMINIMIZATION() 
1 f — INITIALPARTITION() 
2 do e«f 
3 > e is the old partition, f is the new one 
4 f <— REFINE(f) 
5 while eF f 
6 return e 


The refinement is realized by the following function in which we denote by 
a~‘e the equivalence p = q mod a~'e if and only if p-a = q-amode. Again, it 
is understood that p-a is defined if and only if q- a is defined. 


REFINE(e) 
1 foraeé Ado 
2 gate 
3 e — INTERSECTION(e, g) 


4 return e 


The computation of the intersection of two equivalence relations on a n- 
element set can be done in time O(n”) by brute force. A refinement using a 
radix sort of the pairs of classes improves the running time to O(n). Thus, the 
function REFINE() runs in time O(nk) on an automaton with n states on an 
alphabet with & symbols. The loop in the function PARTITION() is executed at 
most n times since the sequence of successive partitions is strictly decreasing. 
Moore’s algorithm itself thus computes in time O(n?k) the minimal automaton 
equivalent to a given automaton with n states and k letters. 


EXAMPLE 1.3.11. Let us consider the set S = (a+bc+ab+c)*. A nondeter- 
ministic automaton recognizing S is represented on the left of Figure 1.19. The 
determinization algorithm produces the automaton on the right of the figure. 

Applying a renumbering of the states, we obtain the automaton on the 
left of Figure 1.20. The minimization procedure starts with the partition e = 
{1,3,4}{2}. Since a~te = e, the action of letter a does not refine e. On the 
contrary, b~te = {1,4}{2}{3} and thus e is refined to f = {1,4}{2}{3} which 
is found to be stable. Thus we obtain the minimal automaton represented on 
Figure 1.20 on the right. 


There is a more complicated but more efficient algorithm, known as Hopcroft’s 
algorithm, which can be used to minimize deterministic automata. We assume 
that the automaton is complete. 

The idea is to replace the global operation of intersection of two partitions 
by the refinement of a partition by a single block. Let P be a set of states, and 
let a be a letter. Let a~'P = {q|q-a€ P}. A set B of states is refined into 
B’ and B” by the pair (P,a) if the sets B’ = BN a'P and B” = B \ B’ are 
both non empty. Otherwise, B is said to be stable by the pair (P,a). 
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(a) A nondeterministic automaton. (b) The determinized version. 


Figure 1.19. Recognizing the set (a + bc + ab + c)* 


(a) Renaming the states. (b) The minimal automaton. 


Figure 1.20. The minimization algorithm 


The algorithm starts with the partition composed of the set 7 of terminal 
states and its complement T°. It maintains a set S of pairs (P,a) formed of a 
set of states and a letter. 

The main loop consists in selecting a pair (P,a) from the set S. Then for 
each block B of the current partition which is refined by (P, a) into B’, B”, one 
performs the following steps 


1. replace B by B’ and B” in the current partition, 
2. for each letter 6, 


a) if (B,b) is in S, then replace (B, y (B",b) and (B",b) in S, 
f b S, th 1 b) b ’b) and (B”,b S 
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(b) otherwise add to S the pair (C,b) where C is the smaller of the sets 
B' and B”. 


If, instead of choosing the smaller of the sets B’ and B”, one adds both sets 
(B’, b) and (B”,b) to S, the algorithm becomes a complicated version of Moore’s 
algorithm. The reason why one may dispense with one of the two sets is that 
when a block B is stable by (P,a) and when P is partitioned into P’ and P”, 
then the refinement of B by (P’,a) is the same as the refinement by (P”,a). 
The choice of the smaller one is the essential ingredient to the improvement of 
the time complexity from O(n?) to O(nlogn). 
This is described in Algorithm HOPCROFTMINIMIZATION() below. 


HOPCROFTMINIMIZATION() 
1 e+ {T,T*°} 
C «the smaller of T and T° 
for a€ Ado 
AppD((C,a), S) 
while S 4 do 
(P,a) — First(S) 
for B € e such that B is refined by (P,a) do 
B', B” — REFINE(B, P, a) 
BREAKBLOCK(B, B’, B”,e) 
> breaks B into B’, B” in the partition e 
UppaTE(S, B, B’, B”) 


rFOUMOAN our wh 


—_ao 


where UPDATE() is the function that updates the set of pairs used to refine the 
partition, defined as follows. 


UpDATE(S, B, B’, B”) 
1 C «the smaller of B’ and B” 
2 forbe Ado 


3 if (B,b) € S then 
4 REPLACE((B, b), S, (B’, b), (B”, b)) 
5 else ADD((C, b), S) 


A careful implementation of the algorithm leads to a time complexity in 
O(kn log n) on an automaton with n states over k letters. One of the key points 
is the implementation of the function BREAKBLOCK(B, B’, B”,e) which has to 
be implemented so as to run in time O(Card(B)). The function actually replaces 
B by B\ B’ and adds a new block B’. For this, one traverses B (in linear time) 
and removes each element which is in B’ from B in constant time and adds it 
to the new block, also in constant time. 

The states of a class are represented by a doubly linked list, one list for 
each class of the partition. This representation allows to remove the element 
from the list, and so also from the class, in constant time. An array of pointers 
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block if i t 


class 1 2 0 2 0 2 


—- 5 
0 1 2 = 
card | 2 1 3 location | ¢ . . ‘ . ! 
0 1 2 3 4 #5 
(a) The classes and their size (b) The blocks of the partition 


Figure 1.21. A partition of Q = {0,...,5}. The class of a state is the 
integer in the array class. The size of a class is given in the array card. 
The elements of a block are chained in a doubly linked list pointed to by 
the entry in the array block. Each cell in these lists can be retrieved in 
constant time by its state using the pointer in the array location. 


indexed by the states allows to retrieve the location of a state in its block in the 
partition. 

In order to be able to check whether a block B is refined by a pair (P, a), 
one maintains an array that counts, for each block B the number of states of 
a~'P that are found to be in B. The test whether B is actually refined consists 
in checking whether this number is both nonzero and strictly less than Card B. 
This requires to maintain a table containing the number of elements of the 
blocks in the current partition. 

To summarize, an arbitrary deterministic finite automaton with n states can 
be minimized in time O(n logn). 

A trim automaton recognizing a finite set of words can be minimized in linear 
time with respect to the size of the automaton. Let 2( be a finite automaton with 
set of states Q recognizing a finite set of words. Since the automaton is trim, it 
is acyclic. Thus we are faced again with DAWG’s already seen in Section 1.3.1. 

The height h(q) of a state q is the length of the longest path in 2 starting in 
q. Equivalently, it is the length of a longest word in the language £L, of words 
recognized by the automaton with initial state g. Of course, for any edge (p, a, q) 
one has h(p) > h(q). Since the automaton is trim, its initial state is the unique 
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state of maximal height. The heights satisfy the formula 


0 if p has no outgoing edge, 


h(p) = { 


1+max(p.a.q)h(q) otherwise. 


In the second case, the maximum is taken over all edges starting in p. Observe 
that this formula leads to an effective algorithm for computing heights because 
the automaton has no cycle. 

The parameters in the algorithm are the number n of states of 21, the number 
m of transitions, and the size k of the underlying alphabet. Of course, m < n-k. 
In practical situations like large dictionaries, the number m is much smaller than 
the product. As we will see, the minimization algorithm can beimplemented in 
time O(n+m-+ k). 

A word about the representation of 2l. Since there are only few edges, a 
convenient representation is to have, for each state p, a list of outgoing edges, 
each represented by the pair (a,q) such that (p,a,q) is a transition. States are 
numbered, so traversal, marking, copying, sorting is done by integers. Also, 
terminal states are represented in such a way that one knows in constant time 
whether a state is terminal. 

It is easily seen that two states g and q’ can be merged into a single state 
in the minimal automaton only if they have the same height. Therefore, the 
Nerode equivalence is a refinement of the partition into states of equal height. 

Recall that the Nerode equivalence is defined by 


p~ qif and only if £, = Ly. 
Recall also that 
p~qifand only if(pEeTSqeT)andp-a~gq-aforalacA (1.3.2) 


This formula shows that if the equivalence is known for all states up to some 
height h — 1, it can be computed, by this formula, for states of height h. To 
describe this in more detail, we associate, to each state g, a sequence of data 
called it signature. It has the form 


o(q) = (s, a1, v(q1); a2, v(q2), +++) r; v(qr)) 


where s = 0 if g is a nonterminal state and s = 1 if q is a terminal state, where 
(q,@1,q1),---5(,@r,q-) are the edges starting in g, and where v(p) is the class 
of the state p. We consider that classes of states are represented by integers. We 
assume moreover that a,,...,@, are in increasing order. This can be realized 
by a bucket sort in time O(n +m +k). 

Then Equation 1.3.2 means that 


p~ q if and only if o(p) = o(q). 


Thus, a signature is a sequence of integers of length at most 1+ 2k, (k = Card A) 
and each element in this sequence has a value bounded by max(2, k,n). Observe 
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that the sum of the lengths of all signatures is bounded by 2m-+-n, where m is the 
number of transitions. In fact, the signature of state p merely a representation 
of the transitions in the minimal automaton starting in the state v(p). 

For computing the Nerode equivalence of the set Q, of states of height h, 
one computes the set of signatures of states in Q;. This set is sorted by a radix 
sort according to their signatures, viewed as vectors over integers. Then states 
with equal signatures are consecutive in the sorted list and the test o(p) = a(q) 
for equivalence can be done in linear time. 

Here is the algorithm 


ACYCLICMINIMIZATION() 
1 pb v[p] is the state corresponding to p in the minimal automaton 
2 (Qo,---;QH) — PARTITIONBYHEIGHT(Q) 
3 for pin Qo do 
4 v[p] — 0 
5 k<—0O 
6 forh-—1to H do 
7 S <— SIGNATURES(Qn, V) 
8 P — RADIXSORT(Qnp, 5) > P is the sorted sequence Q), 
9 p< first state in P 


10 vip] —k 

11 k—k+l1 

12 for each q in P \ p in increasing order do 
13 if o(q) = o(p) then 

14 v{q] — vp] 

15 else vig] — k 

16 (k,p) — (k+1,q) 


17 return v 


A usual topological sort can implement PARTITIONBYHEIGHT(Q) in time 
O(n +m). 

Each signature is then computed in time proportional to its size, so the whole 
set of signatures is computed in time O(n +m). Each radix sort can be done 
in time proportional to the sum of the sizes of the signatures, with an overhead 
of one O(k) initialization of the buckets. So the total time for the sort is also 
O(n+m-+k). 

Observe that the test at line 13 is linear in the length of the signatures, so 
the whole algorithm is in time O(k +n+m). 


EXAMPLE 1.3.12. Consider the automaton of Figure 1.22. The computation 
of the heights gives the follow partition: 


Qo = {4, 8}, Q1 = {3, 7}, Qe = {2,6, 10, 1}, Q3 = {1,859}; Qa — {O} ‘ 


States of height 0 are always final states, and are merged into a class numbered 
0. 
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Figure 1.22. A trim automaton recognizing a finite set. 


3: 0a0b0 
7: 0a0b0 


012 3 5 6 7 9 1011 
1 1 


4 8 
0 0 


2 


(a) Signatures of states 
of height 1. 


(b) The corresponding states of the minimal 
automaton. 


The states of height 1 have the signatures given below. Observe that in a 
signature, the next state appearing in an edge is replaced by its class. This 
can be done because the algorithm works by increasing height. These states are 
merged into a class numbered 1. 

The radix sort of the four states of height 2 gives the sequence (10, 11, 2,6), 
so 10,11 are grouped into a class 2 and 2,6 are grouped into a class 3. 


2: Oalbl 
6 : Oalbl 
10 : 0alb0 0123 45 67 8 9 1011 
11 : 0alb0 V 3 1 0 3 10 22 


(c) Signatures of states 
of height 2. 


(d) The corresponding states of the minimal 
automaton. 


The states of height 3 all give singleton classes, because the signatures are 
different. This is already clear because they have distinct lengths. In other 
term, a refinement of the algorithm could consist in partitioning the states of 
same height into subclasses according to their width, that is the number of edges 
starting in each state. 

Thus, the minimal automaton has 8 states. It is given in Figure 1.23. 
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1 : 0a3b2 

5: 0a3 0123 45 67 8 9 1011 

9 : la3b2c2 V 7531043 1062 2 
(e) Signatures of states (f) The final state vector of the minimal au- 
of height 3 tomaton. 


Figure 1.23. The corresponding minimal automaton. 


1.4. Pattern matching 


The specification of simple patterns on words uses the notion of a regular ex- 
pression. It is an expression build using letters and a symbol representing the 
empty word, and three operators: 


e union, denoted by the symbol ‘+’, 
e product, denoted by mere concatenation, 
e star denoted by ‘*’. 


These operators are used to denote the usual operations on sets of words. The 
operations are the set union, set product 


XY = {ry| rE x,y eV} 
and the star operation 
X* = {a1 --- an | n> 0,01,...,an € XV}. 


A regular expression defines a set of words W(e), by using recursively the op- 
erations of union, product and star. 


Wet f)=We)UW(f), Wlef) =W(e)W(f), We") = Wee)". 


Words in W(e) are said to match the expression e. The problem of check- 
ing whether a word matches a regular expression is called a pattern matching 
problem. 
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For instance, e = (a+ b)*abaab(a+ b)* is a regular expression. The set W(e) 
is the set of words on A = {a,b} having abaab as a factor. More generally, 
for any word w, the words matching the regular expression A*wA* are those 
having w as a factor. Thus, the problem of checking whether a word is a factor 
of another is a particular case of a pattern matching problem. The same holds 
for subwords. 

For each regular expression e, there exists a finite automaton recognizing the 
set of words W(e). In other terms, W(e) is a regular set. A proof of this assertion 
consists in an algorithm for building such a finite automaton, inductively on the 
structure of the expression. Several constructions exist that use slightly different 
normalizations of automata or of expressions. The main variations concern the 
use of ¢-transitions. We present below a construction which makes extensive 
use of €-transitions. The main advantage is its simplicity, and the small size of 
the resulting automaton. 

One starts with simple automata recognizing respectively €¢ and a, for any 
letter a. They are represented in Figure 1.24. One further uses a recursive con- 


“0 Of -O50+ -O40-+- 
(a) Empty set. (b) Empty word. (c) Letter a. 
Figure 1.24. Automata for the empty set, for the empty word, and for 


a letter. 


struction on automata with three constructs implementing union product and 
star. The construction is indicated below. This construction has the property to 


Figure 1.25. Automata for union, product and star. 
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construct finite automata with several particular properties. First, each state 
has at most two edges leaving it. If there are two edges, they have each an 
empty label. Also, there is a unique initial state i and a unique terminal state 
t. Finally, there is no edge entering 7 and no edge leaving t. We call such an 
automaton a pattern matching automaton. 

We use a specific representation of nondeterministic automata tailored to the 
particular automata constructed by the algorithm. A conversion to the repre- 
sentation described above is straightforward. First, an automaton 2l has a state 
INITIAL (the initial state) and a state TERMINAL (the terminal state). Then, 
there are two functions NEXT1() and NEXT2(). For each state p, NEXT1(p) = 
(a, q) if there is an edge (p,a,q). If there is an edge (p,¢,q), then NEXTI1(p) = 
(e,q). If there is a second edge (p,¢,q’), then NEXT2(p) = (e,q’). If no edge 
starts from p, then NEXT(p) is undefined. 

We use a function NEWAUTOMATON() to create an automaton with just 
one initial state and one terminal state and no edges. The function creating an 
automaton recognizing a is given in Algorithm AUTOMATONLETTER. 


AUTOMATONLETTER(a) 
1 2 NEWAUTOMATON() 
2 NEXTI(INITIALg) — (a, TERMINAL) 
3 return 2 


The automata recognizing the union, the product and the star are depicted 
in Figure 1.25. Boxes represent automata, up to their initial and terminal state 
that are drawn separately. All drawn edges are <-transitions. The implemen- 
tation of the corresponding three functions AUTOMATAUNION(), AUTOMAT- 
APRODUCT() and AUTOMATONSTAR() is straightforward. 


AUTOMATAUNION(Q, B) 

1 €<— NEWAUTOMATON() 
NEXTIL(INITIALe) <— (€, INITIALg) 
NEXT2(INITIALe) <— (e, INITIAL) 
NEXT1(TERMINALg) <— (€, TERMINAL¢) 
NEXT1(TERMINALs) <— (€, TERMINAL¢) 
return € 


Aor Wb 


The function AUTOMATAPRODUCT() uses a function MERGE() that merges 
two states into a single one. 


AUTOMATAPRODUCT(2, B) 

1 €<— NEWAUTOMATON() 
INITIALe <— INITIAL 
TERMINALe «— TERMINALS 
MERGE(TERMINALq, INITIAL ) 
return € 


oR WwW bh 


Version June 23, 2004 


38 Algorithms on Words 


AUTOMATONSTAR(Q) 

1 8 — NEwAvUTOMATON() 
NEXTIL(INITIALS ) <— (€, INITIAL) 
NEXT2 (INITIAL) <— (€, TERMINALs; ) 
NEXT1(TERMINALg) — (€, INITIAL) 
NEXT1(TERMINALg) — (€, TERMINAL»; ) 
return € 


Aaw»r w hd 


The practical implementation of these algorithms on a regular expression 
is postponed to the next section. As an example, consider the automaton in 
Figure 1.26. It has 21 states and 27 edges. The size of the pattern matching 


Figure 1.26. The automaton for the expression (a + b)*b(a + 1)(a+b)*. 


automaton recognizing the set of words matching a regular expression is linear 
in the size of the expression. Indeed, denote by n(e) the number of states of the 
pattern matching automaton corresponding to the expression e. Then 


) 
n(e) = 2 
net f) =nle) + n(f) +2 
nlef) = n(e) +n(f) —1 
n(e*) = n(e) +2 


Thus n(e) < 2|e|, where |e| is the length of the expression e (discarding the left 
and right parentheses). The number of edges is at most twice the number of 
states. Thus the space complexity of the resulting algorithm is linear in the size 
of the expression. 

To realize the run of such an automaton on a word w, one uses Algo- 
rithm ISACCEPTED. We observe that in a pattern matching automaton, the 
out-degree of a state is at most 2. Therefore, the time complexity of a call 
IsACCEPTED(w) is O(nm), where n is the size of the regular expression and 
m= |w|. 

In some particular cases, the quadratic complexity O(nm) can be replaced 
by O(n +m). This is the case in particular for the string matching problem 
treated in Algorithm SEARCHFACTOR. 
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1.5. Transducers 


Beyond formal languages, relations between words are a very natural concept. 
We consider relations over words, but most of the general notions work for 
relations over arbitrary sets. 

Formally, a relation p between words over the alphabet A and words over 
the alphabet 6 is just a subset of the Cartesian product A* x B*. We call it 
a relation from A* to 6*. Actually, such a relation can be viewed as a partial 
function f, from A* to the set 8(6*) of subsets of 6* defined by 


f(x) ={yeB* |(z,y]ep}, rEeA*. 


The inverse of a relation o from A* to B* is the relation o~! from B* to A* 
defined by 
o-* = {(v,u) | (u,v) € o}. 


The composition of a relation o from A* to B* and a relation 7 from B* to C* 
is the relation from A* to C* defined by (x, z) € 0 oT if and only if there exists 
y € B* such that (x,y) € o and (y, z) € 7. The reader should be aware that the 
composition of relations goes the other way round than the usual composition of 
functions. The function f,.; defined by the relation goT is foo7(x) = f-(fa(2)). 
One can overcome this unpleasant aspect by writing the function symbol on the 
right of the argument. 

A particular case of a relation p from A* to 6* is that of a partial function 
from A* to B*. In this case, f, is a (partial) function from A* into B*. 


EXAMPLE 1.5.1. Consider the relation y C A* x A* defined by (x,y) € ¥ if 
and only if « and y are conjugate. Clearly, y = y~'. The image of a word z is 
the set of conjugates of w. 


EXAMPLE 1.5.2. Consider the relation 4: C A* x A* defined by 
fe = {(4109 +++ Gn, AnGn—1°++1) | G1,---,An € A}. 


1 


Clearly, pp = ~~~ and jo p is the identity relation. 


EXAMPLE 1.5.3. For the relation p C A* x A* defined by doubling each letter: 


a ae 
p = {(a142°+- Gn, aja5---as) | a1,...,an € A} 
the image of a word x = a a2---@y is aja3---a?. The inverse is only defined 


on words of the form aza3--- a2. 


The set of relations on words is subject to several additional operations. The 
union of two relations p,a C A* x B* is the set union pUo. The product of p 
and 0 C A* x B* is the relation 


po ={(ur,vs) | (u,v) € p, (7,8) € of. 
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The star of o C A* x B* is the relation 
oa” = {(ujug-++Un, V102°++Un) | (ui, vi) € o,n > OF. 


A relation from A* to B* is rational if it can be obtained from subsets of 
(AU {e}) x (BU {e}) by a finite number of operations of union, product and 
star. 

A rational relation that is a (partial) function is called a rational function. 


EXAMPLE 1.5.4. The doubling relation is rational since it can be written, e.g. 
on the alphabet {a,b} as ((a, aa) U (b, bb))*. More generally, for any morphism 
f from A* to B*, the relation p = {(a, f(x) | « € A*} is rational. Indeed, 
p = (Uaea(a, f(a)))*. Thus morphisms are rational functions. 


Just like regular expressions correspond to automata, rational relations cor- 
respond to a kind of automata called transducers which are just automata with 
output. Formally, a transducer over the alphabets A, 6 is an automaton in 
which the edges are elements of Q x A* x B* x Q. Thus each edge (p, u, v, q) has 
an input label u which is a word over the alphabet A and an output label v which 
is a word over the output alphabet B. The transducer is denoted (Q, E,I,T) 
where Q is the set of states, E the set of edges, J the set of initial states and T 
the set of final states. 

There are two “ordinary” automata corresponding to a given transducer. 
The input automaton is obtained by using only the input label of each edge. 
The output automaton is obtained by using only the output labels. 

The terminology introduced for automata extends naturally to transducers. 
In particular, a path is labeled by a pair (x, y) formed of its input label x and 
its output label y. Such a path from p to q is often denoted p — q. Just 
as a finite automaton recognizes a set of words, a transducer recognizes or 
realizes a relation. The algorithm of Section 1.4 can be easily adapted to build 
a transducer corresponding to a given rational relation. 

As for automata, we allow in the definition of transducers the input and 
output labels to be arbitrary, possibly empty, words. The behavior of the trans- 
ducer can be viewed as a machine reading an input word and writing an output 
word through two “heads” (see Figure 1.27). The mechanism is asynchronous 
in the sense that the two heads may move at different speeds. 

The particular case of synchronous transducers is important. A transducer is 
said to be synchronous if for each edge, the input label and the output label are 
letters. Not every rational relation can be realized by a synchronous transducer. 
Indeed, if p is realized by a synchronous transducer, then p is length-preserving. 
This means that whenever (x,y) € p, then |x| = ly]. 

A transducer is literal if for each edge the input label and the output label 
are letters or the empty word. It is not difficult to show that any transducer 
can be replaced by a literal one. 


EXAMPLE 1.5.5. The relation between a word written in lower-case letters 
a,b,c,... and the corresponding upper-case letters A, B,C,... is rational. In- 
deed, it is described by the expression ((a, A) U (b, B) U...)*. This relation is 
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input 
qd 
output 


Figure 1.27. A transducer reads the input and writes the output. 


realized by the transducer of Figure 1.28. This transducer is both literal and 
b|B 


a|A 


Figure 1.28. From lower case to upper case. 


synchronous. 


EXAMPLE 1.5.6. The Fibonacci morphism defined by a — ab, b > ais realized 
by the transducer on the left of Figure 1.29. The transducer on the right of 


a|ab bla 


Figure 1.29. The Fibonacci morphism. 


Figure 1.29 realizes the same morphism. It is literal. 


EXAMPLE 1.5.7. The transducer represented on the left of Figure 1.30 realizes 
the circular right shift on a word on the alphabet {a,b} ending with letter a. 
The transformation consists in shifting cyclically each symbol one place to the 


right. For example 
abbaba 
aabbab 
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The restriction to words ending with letter a is for simplicity (and corresponds 
to the choice of state 0 as initial and final state in the automaton on the left 
of Figure 1.30). The inverse of the right shift is the left shift which shifts 


bla 
a|b 


Figure 1.30. The circular right shift on words ending with a and its inverse. 


b|b 
bla 


all symbols cyclically one place to the left. Its restriction to words beginning 
with a is represented on the right of Figure 1.30. The composition of both 
transformations is the identity restricted to words ending with the letter a plus 
the empty word. 


An important property of rational relations is that the composition of two 
rational relations is again a rational relation. The construction of a transducer 
realizing the composition is the following. We start with a transducer G6 = 
(Q,£,I,T) over A, B and a transducer G’ = (Q’, E’,I',T’) over B, C. We 
suppose that G and G6’ are literal (actually we shall only need that the output 
automaton of G is literal and that the input automaton of 6’ is literal). We 
build a new transducer LU as follows. The set of states of LL is Q x Q’. The set 
of edges is formed of three kinds of edges. 

1. the set of edges (p, p’) —S (q,q’) for all edges p ld qin E and p’ a qd 

in EY. 

2. the set of edges (p, p’) 26 (p, q’) for p’ as qd in E’. 

3. the set of edges (p, p’) a6 (q,p’) for (p ale q) in E. 

The set of initial states of Lis J x I’ and the set of terminal states is T x T’. 
The definition of the edges implies that 


(pr) =% (qs) <> Jy: pq and rs. 


This allows to prove that the composed transducer realizes the composition of 
the relations. 


EXAMPLE 1.5.8. The composition of the circular right shift of Example 1.5.7 
with itself produces the circular right 2-shift which consists in cyclically shifting 
the letters two places to the right for words ending with aa. 


For the implementation of transducers we use a function NEXT(p) which asso- 
ciates to each state p the set of edges beginning at p, and two sets INITIAL and 
TERMINAL to represent the initial and terminal states. 

The algorithm computing the composition of two transducers is easy to write. 
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Figure 1.31. The right 2-shift. 


COMPOSETRANSDUCERS(G, T) 
1 eo Gand & are literal transducers 
2 U-— NEWTRANSDUCER() 
3 for each edge (p, a,b, q) of G do 
4 for each edge (r, b,c, s) of do 
5 add ((p,r), a, ¢, (gq, s)) to the edges of 
6 for each edge (p,a,¢,q) of G do 
7 for each state r of T do 
8 add ((p,r),a,¢,(q,7)) to the edges of UL 
9 for each edge (r,¢,c,s) of T do 
10 for each state p of G do 
11 add ((p,1), €, ¢, (p, s)) to the edges of 
12 =INITIALy, <— INITIAL6 X INITIAL: 
13 TERMINALy, — TERMINAL6 X TERMINAL? 
14 return 


The composition can be used to compute an automaton that recognizes the 
image of a word (and more generally of a regular set) by a rational relation. 
Indeed, let p be a rational relation from A* to 6*, let x be a word over A. Let 
KR be a literal transducer realizing p, and let 2 be a literal transducer realizing 
the relation {(¢,2)}. Let T = CoMPOSE(2, RK) be the composition of 2 and NK. 
The image f(x) = {y € B* | (a, y) € p} is recognized by the output automaton 
of ©. 


EXAMPLE 1.5.9. Consider the word x = ab and the Fibonacci morphism of Ex- 
ample 1.5.6. On the left of Figure 1.32 is a transducer realizing {(¢,7)}, and on 
the right the transducer obtained by composing it with the literal transducer of 
Figure 1.29. The composition of the transducers contains actually an additional 
edge (0,1) —» (0,0) which is useless because the state (0, 1) is inaccessible from 
the initial state. 


A sequential transducer over A, B is a triple (Q,7, 7) together with a partial 
function 


QxA-BxQ 
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ela e|b ela e|b ela 


Figure 1.32. The image f(x) = aba of x = ab by the Fibonacci morphism. 


which breaks up into a next state function Q x A— Q and an output function 
Q x A-— B*. As usual, the next state function is denoted (q,a) > q-a and the 
output function (q,a) + q*a. In addition, the initial state 7 € Q has attached a 
word A called the initial prefiz and T is actually a (partial) function T : Q — B* 
called the terminal function. Thus, an initial prefix and additional suffix can 
be added to all outputs. 

The next state and the output functions are extended to words by p- (xa) = 
(p-x)-a and px (xa) = (p* x)(p- x) * a. The second formula means that the 
output px (xa) is actually the product of the words px and q*a where q = p-a. 
The (partial) function f from A* to B* realized by the sequential transducer is 
defined by f(a) = Aur where wu is the initial prefix, v =i* a andr = T(i- 2). 
A function from A* to 6* that is realized by a sequential transducer is called a 
sequential function. 


EXAMPLE 1.5.10. The circular left shift on words over {a,b} beginning with a 
is realized, on the right of Figure 1.30, by a transducer which is not sequential 
(two edges with input label a leave state 0). It can also be computed by the 
sequential transducer of Figure 1.33 with the initial pair (¢,0) and the terminal 
function T(1) = a. 


Figure 1.33. A sequential transducer for the circular left shift on words 
beginning with a. 


The composition of two sequential functions is again a sequential function. This 
is actually a particular case of the composition of rational functions. The same 
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construction is used to compose sequential transducers and it happens to pro- 
duce a sequential transducer. We give explicitly the form of the composed 
transducer. 

Let G = (Q,i,T) be a sequential transducer over A, B and let 6G’ = 
(Q’,7’,T’) be a sequential transducer over B, C. The composition of G and 
G’ is the sequential transducer G o G’ with set of states Q’ x Q, initial state 
(i’,4) and terminal states T” = T’ x T. Observe that we reverse the order for 
notational convenience. The next state function and the output function are 
given by 

(p’,p)-z = (p'- (p* x), p- 2) 
(p',p) * 2 =p! * (p* a) 


The initial prefix of the composed transducer is the word \” = X’(i’ * A), and 
the terminal function T” is defined by 


T'(7,q) = (q' *T(Q))T'(' -T(Q). 


The value of the terminal function T” on (q’,q) is indeed obtained by first 
computing the value of the terminal function T(q) and then fitting this word in 
the transducer 6’ at state q’. 

For the implementation of sequential transducers we use a partial function 
NEXT(p,a) = (p * a,p-a) grouping the output function and the next state 
function. There is also a pair INITIAL = (A,7) € B* x Q for the initial prefix 
and the initial state and a partial function TERMINAL(q) returning the terminal 
suffix for each terminal state q € T. 


1.5.1. Determinization of transducers 


Contrary to ordinary automata, it is not true that any finite transducer is 
equivalent to a finite sequential one. It can be verified that a transducer is 
equivalent to a sequential one if and only if it realizes a partial function and if 
it satisfies a condition called the twinning property defined as follows. Consider 
a pair of paths with the same input label and of the form 


where 7 and 2’ are initial states. Two paths as above are called twin. The 
twinning property) is that for any pair of twin paths, the output is such that 
v’,v” are conjugate and u/v'u! +++ = ul’v"u" «+. 

EXAMPLE 1.5.11. The circular right shift on all words over {a,b} is realized 
by the transducer of Figure 1.34. It is not a sequential function because the last 
letter cannot be guessed before the end of the input. Formally, this is visible 
because of the twin paths 


bla 


ab|ba 
0— 1 — 


1 1 
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and 
b|b 


ab|ba 
3 — 3 — 


3 3: 


with distinct outputs ababa--- and bbababa:--. 


bla 


TeOsOOIEL rt 


a|b 
Figure 1.34. The circular right shift. 


The computation of an equivalent sequential transducer is a variant of the de- 
terminization algorithm of automata. The main difference is that it may fail 
to terminate since, as we have seen before, it cannot be always performed suc- 
cessfully. We start with a transducer 2 which is supposed to be equivalent to 
a sequential one. We suppose that 2l is literal (or, at least, that its input au- 
tomaton is literal) and trim. The states of the equivalent sequential transducer 
% are sets of pairs (u,q) € B* x Q. A pair (u,q) € B* x Q is called a half-edge. 
The states are computed by using in a first step a function NEXT() represented 
below. The value of NEXT(S,a) on a set S' of half-edges and a letter a is the 
union, for (u,p) € S of the set of half-edges (uwvw,r) such that there are, in 2, 

(i) an edge p ak Ay 

(ii) and a path g —> r 
We use a function NEXTy(p, a) returning the set of half-edges (v,q) such that 
(p,a,v,q) is an edge of the transducer 2. 


NEXT(S, a) 
1 pS isa set of half-edges (u,q) € B* x Q, and a is a letter 
2 7T-9O 
3 for (u,p) € S do 
4 for (v,q) € NEXTg(p, a) do 
5 T — TU (uv, gq) 
6 return CLosuRE(T) 


The set CLOSURE(T) is the set of half-edges (uw, 1) such that there is a path 
q sl4 pin 2 for some half- edge (u,q) € T. If the transducer is equivalent to a 
deterministic one, this set is finite. The computation of CLOSURE(T) uses as 
usual an exploration of the graph composed of the edges of the form (q,¢,v,1r). A 
test can be added to check that this graph has no loop whose label is nonempty 
word, i.e. that the set CLOSURE(T) is finite. 

As an auxiliary step, we compute the following function 
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Lcp(U) 
1 pU isaset of half-edges 
2 v< LONGESTCOMMONPREFIX(U) 
3. U’ — ERASE(v, U) 
4 return (v,U’) 


The function LONGESTCOMMONPREFIX(U) returns the longest common 
prefix of the words wu such that there is a pair (u,qg) € U. The function 
ERASE(v,U) returns the set of half-edges obtained by erasing the prefix v of 
the words u appearing in the half-edges (u, q) € U. 

In a second step, we build the set of states and the next state function 
of the resulting sequential transducer 8%. As for automata, we use a function 
EXPLORE() which operates on the fly. 


EXPLORE(T, S, 8) 


1 


2 
3 
4 
5 
6 
7 
8 
9 


> T is a collection of sets of half-edges 
> S is an element of T 
for each letter a do 
(v,U) — Lcop(NExT(S, a)) 
NEXT (S,a) <— (v,U) 
if U #0 and U ¢T then 
T-TuUU 
(T,8) — ExpiLore(T, U, 8) 
return (7,9) 


We can finally write the function realizing the determinization of a trans- 
ducer into a sequential one. 


TOSEQUENTIALTRANSDUCER(Q) 


> 2 is a transducer 
3% — NEWSEQUENTIALTRANSDUCER() 
I — CiosurE({e} x INITIALg) 
INITIALy <— I 
> T is a collection of sets of half-edges 
TI 
(7,8) — EXpLore(T, J, 8) 
for S€T do 
for (u,q) € S do 
if g © TERMINALg then 
TERMINALS (S) — u 
return 8 


EXAMPLE 1.5.12. The application of the determinization algorithm to the 
transducer on the right of Figure 1.30 produces the sequential transducer of 
Figure 1.33 as obtained on Figure 1.35 
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b|b 


Figure 1.35. A sequential transducer for the circular left shift on words 
beginning with a obtained by the determinization algorithm. 


A test can be added to the determinization algorithm to stop the computation 
in case of failure, that is if one of the folllowing situations occur, implying that 
the transducer 2( is not equivalent to a sequential one. First, one may check 
at line 4 in algorithm EXPLORE() that the half edges appearing in a state of 
%8 have a label of bounded length. Indeed, it can be shown that there exists a 
constant AK, depending on 2 such that for each half-edge (u,q) appearing in a 
state of 8, the length of u is bounded by K (otherwise 2 does not satisfy the 
twinning property, see Problem 1.5.1). Second, a test can be added at line 10 of 
algorithm TOSEQUENTIALTRANSDUCER() to check that if a state of 8 contains 
two half-edges (u,q) and (v,r) with qg,r terminal, then u = v (if this condition 
fails to hold, then 2 does not realize a function). 


1.5.2. Minimization of transducers 


Just as there is a unique minimal deterministic automaton equivalent to a given 
one, there is also a unique minimal sequential transducer equivalent to a given 
one. The minimization of sequential transducers consists in two steps. A pre- 
liminary one, called normalization allows to produce output as soon as possible. 
The second step is quite similar to the minimization of finite automata. 

Let & = (Q,i,7) be a sequential transducer. For each state p € Q, let us 
denote by 4; the subset of B* recognized by the output automaton correspond- 
ing to 2 with p as initial state. The normalization consists in computing for 
each state p € Q, the longest common prefix 7, of all words in 1%. 

The normalized transducer is obtained by modifying the output function 
and terminal function of 2. We set 


N=Am;, pe a= ™, (p tO)na, 2 (o)= m, T(p) 


The computation of the words 7, can be performed as follows. It uses the 
binary operation associating to two words their longest common prefix. This 
operation is associative and commutative and will be denoted in this section by 
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a+, like a sum. We consider the set K = 6* UO formed of B* augmented with 
0 as ordered by the relation x < y if x is a prefix of y or y = 0. For p,q € Q, 
we denote by M,,, the element of Which is the longest common prefix of all 
words v such that there is an edge p —> q (and M,, = 0 if this set is empty). 
We also consider the Q-vector N defined by N, = T(p), where T is the terminal 
function, and N, = 0 if T(p) is empty. For a Q-vector X of elements of K, we 


consider the vector Y = MX + N which is defined for p € Q by 


Yp = >» Mp,qXq + Np 
qeQ 


Recall that all sums are in fact longest common prefixes and that the right- 
hand side of the equation above is the longest common prefix of the words 
Mpy,qXq, for ¢ € Q, and N,. It can be checked that the function f defined by 
f(X) = MX + N is order preserving for the partial order considered on the set 
kK. Thus, there is a unique maximal fix-point which satisfies YX = MX + N. 
This is precisely the vector of words P = (7,) we are looking for. It can be 
computed as the limit of the decreasing sequence f*(0) for k = 1,2,..... 


EXAMPLE 1.5.13. Consider the transducer realizing the Fibonacci morphism 
represented on the right of Figure 1.29. The determinization of this transducer 
produces the sequential transducer on the left of Figure 1.36. The computation 


Figure 1.36. The normalization algorithm. 


of the vector P uses the transformation Y = MX + N with 


Yo = aX, +aXot+e 
Yi = baXy + baX, +b 


The successive values of the vector P are P = [0 0], P= [e b]. The last value 
satisfies P = MP + N and thus it is the final one. The normalized transducer 
is shown on the right of Figure 1.36. 


The algorithm to compute the array P is easy to write. 
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LONGESTCOMMONPREFIX ARRAY (21) 
1 po P,P’ are arrays of strings initially null 


2 pt M is the matrix of transitions of 21 and N the vector of terminals 
3 do P<Ffp’ 

4 P'’—MP+N 

5 while P 4 P’ 

6 return P 


The expression 1M P+ N should be evaluated using the longest common prefix 
for the sum, including those appearing in the product MP. The normalized 
transducer can now be computed by the following function. 


NORMALIZETRANSDUCER(2l) 
1 P — LONGESTCOMMONPREFIXARRAY (Ql) 
(A, 7) — INITIAL 
INITIAL — (AP{i], 7) 
for (p,a) € Q x Ado 
(u, gq) — NEXT(p, a) 
NEXT(p, a) — P[p|~*uP{q] 
for p € Q do 
T[p] — Plp| "Tp 


The last step of the minimization algorithm consists in applying the mini- 
mization algorithm to the input automaton, starting from the initial partition 
which is defined by p = q if T(p) = T(q) and if pxa = q*a for each a € A. Any 
one of the minimization algorithms presented in Section 1.3.4 applies. 


ANOIKRW NL 


EXAMPLE 1.5.14. We apply the minimization algorithm to the transducer ob- 
tained after normalization on the right of Figure 1.37. The two states are found 
to be equivalent. The result is the sequential transducer on the right of Fig- 


a|ab 


oe 8 
bla 


Figure 1.37. The minimization algorithm. 


ure 1.37 which is of course identical to the transducer on the left of Figure 1.29. 


1.6. Parsing 
There are other ways, beyond regular expressions, to specify properties of words. 


In particular, context-free grammars offer a popular way to describe words sat- 
isfying constraints. These constraints often appear as the syntactic constraints 
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defining programming languages or also natural languages. The patterns speci- 
fied by regular expressions can also be expressed in this way, but grammars are 
strictly more powerful. 

The problem of parsing or syntax analysis consists in computing the deriva- 
tion tree of a word, given a grammar. 

A grammar © on an alphabet A is given by a finite set V and a finite set 
RCVx(AUY)*. The elements of V are called variables and the elements of R 
are called the productions of the grammar. A production (v, w) is often written 
v — w. One fixes moreover a particular variable 7 € Y called the axiom. The 
grammar is denoted by 6 = (A, V,R, 7). 

Given two words x,y € (AU Y)*, one writes « — y if y is obtained from x 
by replacing some occurrence of v by w for some production (v,w) in R, ice. 
if « = pug, y = pwq. One denotes by —> the reflexive and transitive closure of 
the relation >. Thus x — y if there exists a sequence wp = 2, W1,...,Wn = y of 
words wp, € (AU V)* such that wy, — wp4i for 0 < h <n. Such a sequence is 
called a derivation from x to y. The language generated by the grammar 6 is 
the set 

L(®) = {x € A* | ia}. 


One may more generally consider the language generated by any variable v, 
denoted by L(G, v) = {x € A* | v-> a}. 

A grammar 6 = (A, V, 7,7) can usefully be viewed as a system of equations, 
where the unknowns are the variables. Consider indeed the system of equations 


v=W, (vey) (1.6.1) 


where W, = {w | (v,w) € R}. If each variable v is replaced by the set L(6,v), 
one obtains a solution of the system of equations which is always the smallest 
solution (with respect to set inclusion) of the system. 

A variant of the definition of a grammar is often used, where the sets W, 
of Equation 1.6.1 are regular sets. In this case, these sets are usually described 
by regular expressions. This is equivalent to the first definition but often more 
compact. We give two fundamental examples of languages generated by a gram- 
mar. 


EXAMPLE 1.6.1. As a first example, let A = {a,b}, V = {v} and R be com- 
posed of the two productions 


v—>avuv, vb. 


The language generated by the grammar 6 = (A,V,R,v) is known as the 
Lukasiewicz language. Its elements can be interpreted as arithmetic expressions 
in prefix notation, with a as an operator symbol and 6 as an operand symbol. 
The first words of L(G) in radix order are b, abb, aabbb, ababb, aaabbbb, aababbb, 
aabbabb, abaabbb,.... In alphabetic order (with a < 6b) the last words are 
..., abb, b. 
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EXAMPLE 1.6.2. The second fundamental example is the Dyck language gen- 
erated by the grammar 6 with the same sets A, V as above and the productions 


v—avbv, ve. 


Let M be the language generated by this grammar. Then M = aMbM +e. 
Set D = aMb. Then M = DM-+e. This shows that M = D*, and thus 
D =aD*b. The set M is called the Dyck language, and D is the set of Dyck 
primes. The words in M can be viewed as well-formed sequences of parentheses 
with a as left parenthesis and 6b as right parenthesis. The words of D are the 
words in M which are not products of two nonempty words of M. The first 
words in radix order in D and in D* are respectively ab, aabb, aababb,..., and 
€, ab, aabb, abab, aabbab. A basic relation between the Lukasiewicz set £ and the 
Dyck language M is the equation 


L=Mb. 


This is easy to verify, provided one uses the equational form of the grammar. 
The set £ is indeed uniquely defined as the solution of the equation 


L=allt+b (1.6.2) 
Since M = aMbM + ¢, multiplying both sides by b on the right, we obtain 
Mb=aMbMb+ b. 


which is Equation 1.6.2, whence Mb = L. There is a simple combinatorial 
interpretation of this identity. Let d(a) denote the difference of the number of 
occurrences of a and of b in the word x. One can verify that a word x is in M 
if and only if 6(a) = 0 and 0(p) > 0 for each prefix p of x. Similarly, a word x 
is in £ if and only if 6(a) = —1 and 6(p) > 0 for each proper prefix p of «. 


A derivation tree for a word w is a tree T labeled by elements of AUVU {e} 
such that 


1. the root of T is labeled by i. 


2. for each interior node n, the pair (v,x) formed by the label v of n and the 
word x obtained by concatenating the labels of the children of n in left to 
right order is an element of R. 


3. A leaf is labeled € only if it is the unique child of its parent. 


4. The word w is obtained by concatenating the labels of the leaves of T in 
increasing order. 


A derivation tree is a useful shorthand for representing a set of derivations. 
Indeed, any traversal of the derivation tree produces a derivation represented 
by this tree, and conversely. 
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Figure 1.38. A derivation tree for the word abaabb in the Dyck grammar. 


We present now in an informal manner two strategies for syntax analysis. 
Given a grammar 6 and a word x, we want to be able to check whether «x is 
in L(®). This amounts to build a derivation ix from the axiom i of 6 to 
xz. There are two main options for doing this. The first one, called top-down 
parsing, builds the derivation from left to right (from 7 to x). This corresponds 
to constructing the derivation tree from the root to the leaves. The second 
one, called bottom-up parsing, builds the derivation from right to left (from x 
backwards to i). This corresponds to constructing the derivation tree from the 
leaves to the root. 


1.6.1. Top-down parsing 


The idea of top-down parsing is to build the derivation tree from the root. This 
is done by trying to build a derivation i> x and from left to right. The current 
situation in a top-down parsing is as follows (see Figure 1.39). A derivation 
i— yw has already been constructed. It has produced the prefix y of x = yz. 
It remains to build the derivation w—> z. We may assume that w starts with 
a variable v, that is w = vs. The key point for top-down parsing to work is 
that the grammar fulfills the following requirement. The pair (v,a), where a is 
the first letter of z, uniquely determines the production v — a to be used, i. e. 
such that there exists a derivation as —> z. Grammars having this property for 
all x usually are called DL(1) grammars. 

We illustrate this method on two examples. The first one is the example of 
arithmetic expressions, and the second one concerns regular expressions already 
considered in Section 1.4. We consider the following grammar defining arith- 
metic expressions with operators + and * and parenthesis. The grammar allows 
unambiguous parsing of these expressions by introducing a hierarchy (expres- 
sions > terms > factors) reflecting the usual precedence of arithmetic operators 
(* > +). 


BoEALT|? 
ToT+F|F (1.6.3) 
F—(E)|c 
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Figure 1.39. Top down parsing. 


where c is any simple character. 

We want to write a program to evaluate such an expression using top-down 
parsing. The idea is to associate to each variable of the grammar a function 
which acts according to the right side of the corresponding production in the 
grammar. To manage the word to be analyzed, a function CURRENT() gives the 
first letter of the suffix of the input word that remains to be analyzed. In syntax 
analysis, the value of the function CURRENT() is called the lookahead symbol. 

A function ADVANCE() allows to progress on the input word. The value of 
CURRENT() allows one to choose the production of the grammar that should be 
used. 

As already said, this method will work provided one may uniquely select, 
with the help of the value of CURRENT(), which production should be applied. 
However, we are already faced with this problem with the productions E — 
E+T and E — T, because the first letter of the input word does not allow to 
know whether there is a + sign following the first term. This phenomenon is 
called left recursion. To eliminate this feature, we transform the grammar and 
replace the two rules above by the equivalent form EF = T(+T7)*. This shows 
that every expression starts with a term, and the continuation of the derivation 
is postponed to the end of the analysis of the first term. 

The function corresponding to the variables F is given in Algorithm EVAL- 
Exp. It returns the numerical value of the expression. 


EVALEXP() 
1 v— EVALTERM() 


2 while CURRENT() = ‘+’ do 
3 ADVANCE() 

4 v — v+ EVALTERM() 
5 return v 


The functions EVALTERM() and EVALFACT() corresponding to T and F are 
similar. 
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EVALTERM() 
1 v< EVALFACT() 
2 while CURRENT() = ‘x’ do 
3 ADVANCE() 
4 uv — uv * EVALFACT() 
5 return v 
EVALFACT() 
1 if CURRENT() = ‘(’ then 
2 ADVANCE() 
3 uv — EVALEXP() 
4 ADVANCE() 
5 else v ~ CURRENT() 
6 ADVANCE() 
7 return v 


The instruction at line 5 of the function EVALFACT() assigns to v the nu- 
merical value corresponding to the current symbol. 

The evaluation of an expression, involving the parsing of its structure, is 
realized by the calling EVALEXP(). 

As a second example, we show that the syntax of regular expressions can also 
be defined by a grammar. This is quite similar to the previously seen grammar 
of arithmetic expressions. 

E> E4+T|T 
ToTF|F 
F>G|G 
G-—(E)|c 
The symbol c stands for a letter or the symbol representing the empty word. A 
top-down parser for this grammar allows one to implement the constructions of 
the previous section that produce a finite automaton from a regular expression. 

We have just seen top-down parsing developed on two examples. These 
examples show how easy it is to write a top-down analyzer. The drawback of 
this method is that it assumes that the grammar defining the language has a 
rather restricted form. In particular, it should not be left recursive, although 
there exist standard procedures to eliminate left recursion. However, there 
exist grammars that cannot be transformed into equivalent DL(1) grammars 
that allow top-down parsing. The letters L in the acronym LL(1) refer to left 
to right processing (on both the text and the derivation), and the number 1 
refers to the number of lookahead symbols. 

The precise definition of LL(1) grammars uses two functions called FIRST() 
and FOLLOW() that associate to each variable a set of terminal symbols. For a 
variable x € V, FIRST(a) is the set of terminal symbols a € A such that there 
is a derivation of the form 2 —> au. The function First() is extended to words 
in a natural way: FirsT(w) is the set of terminal symbols a such that w—> au. 

For each variable x € V, FOLLOW(z) is the set of terminal symbols a € A 
such that there is a derivation u—> vraw with a “following” <. 


(1.6.4) 
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To compute FirstT(), we build a graph with AU V as vertices and edges the 
pairs (%,y) € Vx (AU YV) such that there is a production of the form xz > uyw 
with we. Then a € First(2) iff there is a path from x to a in this graph. 
The graph corresponding to the grammar of arithmetic expressions is shown on 
Figure 1.40(a). 


(a) First(). (b) FoLLow(). 


Figure 1.40. The graphs of First() and of FOLLOw(). 


The algorithm used to compute FIRST() is given more precisely below. We 
begin with an algorithm (EPSILON()) which computes a boolean array epsilon 
indicating whether a symbol v is nullable, i.e. whether v > ¢. The array epsilon 
has of size n + k, where n is the number of variables in the grammar and k is 
the number of terminals. 


EPSILON() 
1 for each production v — ¢ do 
2 epsilon|v] — true 
3 fori-—O0Oton—1do 
4 for each production v > 21 ---%p, do 
5 epsilon|v] — epsilon|v] V (epsilon|[x,] A --+- A epsilon|tm]) 
6 return epsilon 


It is easy to compute a function ISNULLABLE(w) for w = 2: %p as the 
conjunction of the boolean values epsilon|x;]._ The computation of FIRST() 
consists in several steps. We first compute the graph defined above. The graph 
is represented by the set FIRSTCHILD(v) of successors of each variable v. The 
function FIRST() is computed after a depth-first exploration of the graph has 
been performed. 


FIRSTCHILD(v) 
1 pS is the set of successors of v 
2 9-9 
3 for each production v > 71 ---%, do 
4 for i— 1 to m do 
5 S<-—SUa; 
6 if epsilon{x;] = false then 
7 break 
8 return S$ 
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We mark vertices in the graph by a standard depth-first exploration. 


EXPLOREFIRSTCHILD(v) 
1 firstmark[v] — true 
2 for each x € FIRSTCHILD(v) do 
3 if firstmark|a] = false then 
4 EXPLOREFIRSTCHILD(2) 


The array firstmark is used for exploration of the graph of First(). Finally, 
we compute FIRST(). 


FIRST(v) 
1 p> mark is an array initialized to false 
2 EXPLOREFIRSTCHILD(v) 
3 Seo 
4 for each terminal c do 
5 if firstmark|c] then 
6 S-SuUe 
7 return S$ 


The values of the function FIRST() could of course be stored in an array 
first. The extension of FIRST to words is straightforward. 


FIRST(w) 
1 S+9 
2 fori+1tondo > w has length n 
3 S — SU FIRST(wii]) 
4 if epsilon[w[i]| = false then 
5 break 
6 return S$ 


There is an alternative way to present the computation of FirsT(), by means 
of a system of mutually recursive equations. For this, observe that for each 
variable x, FIRST(«) is the union of the sets FIRST(y) over the the set S(a) of 
successors of x in the graph of FirsT(). Thus, the function FIRST() is the least 
solution of the system of equations 


FIRST(z) = Uyes(z) FIRST(y) (x EV) 


such that FrrsT(a) = a for each letter a € A. For example, the equations for 
the grammar 1.6.3 are 


First(£) = First(£) U First(T) 
First(7’) = First(T) U First(F’) 
First(F’) = {(,c} 


To compute the function FOLLOW(), we build a similar graph. There are 
two rules to define the edges. 
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. if there is a production x — uvw with a terminal symbol a in FIRST(w), 


then (v,a) is an edge. 


if there is a production z > uaw with w —>e, then (z, z) is an edge (notice 
that we use the productions backwards). 


The graph of FOLLOW() for the grammar of arithmetic expressions is shown 
on Figure 1.40(b). The computation of the function FOLLOw is analogous. It 
begins with the computation of the graph SIBLING(z). 


SIBLING(2z) 
1 SO 
2 for each production z — uxw do 
3 S — SU FIRsT(w) 
4 if ISNULLABLE(w) then 
5 S-—SUz 
6 return $ 


The depth-first exploration EXPLORESIBLING(v) is then performed as be- 


fore. It produces an array followmark which is used to compute the function 
FOLLOW(). 
FOLLOW(v) 

1 pb followmark is an array initialized to false 

2 EXPLORESIBLING(v) 

3 Seo 

4 for each terminal c do 

5 if followmark{c| then 

6 S-—SUe 

7 return S$ 


As for the function First(), the function FOLLOW() can also be computed 
by solving a system of equations. The precise definition of an LL(1) grammar 
can now be formulated. It is a grammar such that 


1. 


for each pair of distinct productions x — u, x — v, with the same left side 
and u,v # €, one has 


FIRST(u) 9 First(v) = 0. 


2. For each pair of distinct productions of the form « — u, x — «, one has 


FIRST(u) M FOLLOW(z) = 0. 


Observe that our grammar for arithmetic expressions violates the first con- 
dition, since for instance FIRST(£) = First(T), although we have two produc- 
tions FE — E+T and E — T with the same left hand side. We have already 
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met this problem of left recursion, and solved it by transforming the grammar. 
The solution that we described is actually equivalent to consider the grammar 


E =>TE' 
E' +4TE’' |e 
T — FT’ (1.6.5) 
T’ + *xFT’ |e 
F +(E)|c 
This grammar is equivalent to grammar 1.6.3. It meets the two conditions 
for being LL(1). Indeed, the functions FrrsT() and FOLLOW() are given in 
Figure 1.41. 


O-O O-O 


(a) FirstT(). (b) FOLLow(). 


Figure 1.41. The graphs of First() and FOLLOW() for the grammar 1.6.5. 


For example, consider the productions E’ — ¢ and E’ — +TE’. The symbol 
+ is not in FOLLOW(E£’), and thus the second condition is satisfied for this pair 
of productions. The characterization allows us to fill the entries of a table called 
the parsing table given in 1.1. This is an equivalent way to define the mutually 
recursive functions we defined above (for the wise : this is also a way to convince 
oneself that the programs are correct !) 


a 
es re 
Pee] 


Fl a A ee a Pe 
F[ Foe {UT UFO | 


Table 1.1. The parsing table of grammar 1.6.5. 


The computation of the LL(1) parsing table uses the following algorithm. 
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LLTABLE() 
1 p> computes the LL(1) parsing table M 
2 for each production p:v — w do 


3 for each terminal c € FIRST(w) do 

4 M{v][c] — p 

5 if w =e then 

6 for each terminal c € FOLLOW(v) do 
7 Mell — p 

8 return M 


The above algorithm as written supposes the grammar to be LL(1). Error 
messages to inform that the grammar is not DL(1) can easily be added. 


1.6.2. Bottom-up parsing 


We now describe bottom-up parsing which is a more complicated but more 
powerful method for syntax analysis. 

The idea of bottom-up parsing is to build the derivation tree from the leaves 
to the root. This method is more complicated to program, but is more powerful 
than top-down parsing. 


oe text 


Figure 1.42. Bottom up parsing. 


The current situation of bottom-up parsing is pictured in Figure 1.42. The 
left part of the text which has already been analyzed has been reduced, using 
the productions backwards, to a string that is kept in a stack. We will see below 
that this actually corresponds to a last-in first-out strategy. 

We present bottom-up parsing on the example of arithmetic expressions 
already used above. 

:BROE+T 
:B OT 
:ToT«F 
:TOF 

: F — (E) 
:Foec 


(1.6.6) 


Aor WNH 
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We reproduce the grammar 1.6.3 with productions numbered from 1 to 6. 

Programming a bottom-up analyzer involves the management of a stack 
containing the part of the text that has already been analyzed. The evolution 
of the stack and of the text is pictured below (Figure 1.43) to be read from top 
to bottom. 


ABaNIo»1RWN EH 


Figure 1.43. Evolution of the stack and of the text during the bottom-up 
analysis of the expression (1 + 2) * 3. 


At the beginning, the stack is empty. Each step consists either in 


1. transferring a new symbol from the text to the stack (this operation is 
called a shift); 


2. reducing the top part of the stack according to a rule of the grammar (this 
is a reduction). 


As an example, the second and third row in Figure 1.43 are the results of shifts, 
while the three following rows are the results of reductions by rules number 6, 
4, and 2 respectively. 

To be able to choose between shift and reduction, one uses a finite automaton 
called LR automaton. This automaton keeps track of the information concerning 
the presence of the right side of a rule at the top of the stack. In our example, 
the automaton is given in Figure 1.44. 
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0} —* +) _* +6) * +9) * + to7 


Figure 1.44. The LR automaton. 


The input to the LR automaton is the content of the stack. According to the 
state reached, and to the lookahead symbol, the decision can be made whether 
to shift or to reduce, and in the latter case by which rule. The fact that this 
is possible is a property of the grammar. These grammars are called SZR- 
grammars. 

In practice, instead of pushing the symbols on the stack, one rather pushes 
the states of the ZR automaton. The result on the expression (1+ 2) «3 is shown 
on Figure 1.45. 

The decision made at each step uses two arrays S' and R, represented on 
Figure 1.46. 

The array S is the transition table of the LR automaton. Thus S[p]{c] is the 
state reached from state p by reading c. The table R indicates which reduction 
to perform. The value R[p][c] indicates the number of the production to be 
used backwards to perform a reduction when the state p is on top of the stack 
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* * 


WWWWwWwWWWwWwWwWwWww w 
PARHAPHAHHAHHHHFHFFHFHFHFHHHHHSFH 


* 
* 
* 
* 
* 
* 
* 
* 


x* * * 


Figure 1.45. The stack of states of the LR automaton during the bottom- 
up analysis of the expression (1 + 2) * 3. 


ETF 
0|5 4 1 2 3 0 
1 6 Acc 1 
2 7 2 2 2 2 
3 3 4 4 44 
41/5 4 8 2 3 4 
5 5 6 6 6 6 
6|5 4 9 3 6 
7\5 4 10 7 
8 6 11 8 
9 7 9 1 1 1 
10 10 3.3 3.3 
11 11 5 5 5.5 
(a) The array S. (b) The array R. 


Figure 1.46. The arrays S and R 


and the symbol c is the lookahead symbol. Empty entries in tables S and R 
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correspond to non existing transitions. A special state Accept, abbreviated as 
Acc is the accepting state ending the computation with a successful analysis. 
The tables S and R could be superposed because their nonempty entries are 
disjoint. Actually, this is necessary for the LR-algorithm to work! 

The implementation of the algorithm is given in the function LRPARSE(). It 
uses, on the input, the two functions CURRENT(), ADVANCE() already described 
earlier, and the symbol ‘$’ to mark the end of the text. The function ToP() and 
PusH() are the usual functions on stacks. The function REDUCE() operates in 
three steps. The call REDUCE(n), where n is the index of the production r > u, 
consists in the following 


1. it erases from the stack the number of states equal to the length of u, 
2. it computes the new value p = ToP() and the state g = T[p][r], 
3. it pushes qg on the stack. 


In the implementation, the value —1 represents non existing transitions. The 
function returns the boolean value true if the analysis was successful, and false 
otherwise. There are three cases of failure 


1. there is no legal shift nor legal reduction, this is checked at lines 5 and 9. 
This happens for instance if the input is «‘)’. 


2. the text has not been exhausted at the end of the analysis, for instance if 
x = ‘(’; this leads to the same situation as above, because the state Accept 
can only be accessed by the end marker. 


3. the text has been exhausted before the end of the analysis; in this case, 
the end marker leads to an empty entry in the tables. 


LRPARSE(2) 

1 while Top() 4 Accept do 

2 p — ToP() 

3 c — CURRENT() 

4 qT hplld 

5 if q#-—1 then 

6 PusH(q) 

7 ADVANCE() 

8 else n — R[p]|c] 

9 if n £4 —1 then 
10 REDUCE(n) 
11 else return false 


12 return true 
There remains to explain how to compute the ZR automaton and the corre- 


sponding tables from the grammar. We work with an end marker ‘$’. Accord- 
ingly, we add to the grammar an additional rule which, in our running example, 
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is BE’ > E$. The LR automaton recognizes the content of the stack and its 
state allows one to tell whether the right side of some production is present on 
the top of the stack. The set of possible stack contents (sometimes called the 
viable prefixes) is the set 


X = {pipo-:-Pn | pi € Pn > 0} 


where P is the set of prefixes of the right sides of the productions and where, for 
each i, 1 <i <n-—1, there is a production (a;,v;) such that p;2j+1 is a prefix 
of v;, and x; is the axiom of the grammar. One may verify this description 
of ¥ by working on the bottom-up analysis backwards. It is easy to build a 
non deterministic automaton recognizing the set ¥ above. It is built from the 
automata recognizing the right sides of the productions and adding e-transitions 
from each position before a variable y to the initial positions of the productions 
with left side y. 

The result is represented on Figure 1.47. The circled states correspond to 
full right hand sides and thus to productions of the grammar. 

To be complete, we should add the transitions corresponding to the rule 
E’ — E$. The states 0 and 4 which correspond to the productions with left 
side E. The automaton of Figure 1.44 is just the result of the determinization 
algorithm applied to the non deterministic automaton obtained. This explains 
how the LR automaton and thus the table S$, which is just its transition table, 
are built. There still remains to explain how table R is built. We have R[p|[c] = 
n if and only if the reduction by production n : x — v is possible in state p, 
and provided the lookahead symbol c is in FOLLOW(«). This solves the conflicts 
between shift and reduce. 

Suppose for example that the variable T is on top of the stack, as at lines 
5, 14,18 of Figure 1.43. At each of these lines, we can either reduce by production 
2 or shift. Similarly, at line 10 we can either reduce by production 1 or 2, or 
shift. We should reduce only if the lookahead symbol is in FOLLOW(£). This 
is why we choose to reduce by production 2 at lines 5 and 18. At line 14, we 
choose to shift, because the symbol * is not in FoLLow(£). At line 10, we 
reduce by production 1 because the corresponding state 9 allows this reduction 
and the lookahead symbol ‘)’ is in FOLLOW(£). 

A grammar such that the method used above to fill the table R works is 
called SER(1). A word on this terminology. The acronym LR refers to a left 
to right analysis of the text and a rightmost derivation (corresponding to a 
bottom-up analysis). A grammar is said to be L.R(0) if no shift-reduce conflict 
appears on the LR automaton. The 0 means that no lookahead is needed to 
make the decisions. This is not the case of Grammar 1.6.6, as we have seen. The 
acronym SLR means ‘simple LR’ and the integer 1 refers to the length of the 
lookahead. Formally, a grammar is said to be SZR(1) if for any state p of the 
LR automaton and each terminal symbol c, at most one of the two following 
cases arise. 


1. There is a transition from p by c in the automaton. 
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OO © . om 
—(4)_*—+©) 


e to 6,10 
©} -—-©)——© 
Ke to 12,16 


Me 


®——@O+*@—® 


to 0,4 


Figure 1.47. A non deterministic LR automaton 


2. There is a possible reduction in state p by production n : « — v such that 
c € FOLLOW(z). 


In practice, this condition is equivalent to the property that the sets of non 
empty entries of the tables S and RF are disjoint. 

More complicated methods exist, either with lookahead 1 or with a larger 
lookahead, although a lookahead of size larger than 1 is rarely used in practice. 
With lookahead 1, the class of LR(1) grammars uses an automaton called the 
ER(1) automaton to keep track of the pair (s,c) of the stack content s and the 
lookahead symbol c to be expected at the next reduction. The main drawback 
is that the number of states is much larger than with the LR(0) automaton. 


1.7. Word enumeration 
One often has to compute the number of words satisfying some property. This 


can be done using finite automata or grammars as illustrated in the following 
examples. 


1.7.1. Two illustrative examples 


The first example illustrates the case of a property defined by a finite automaton. 
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EXAMPLE 1.7.1. The number wu, of words of length n on the binary alphabet 
{a,b} which do not contain two consecutive a’s satisfies the recurrence formula 
Un+t1 = Un + Un—1. Indeed, a nonempty word of length n can either terminate 
with a or b. In the first case, it has to terminate with ba unless it is the word 
a. Since up = 1 and u; = 2, the number wu, is the Fibonacci number F),+2. 

This argument can be used quite generally when the corresponding set of 
words is recognized by a finite automaton. In the present case, the set S without 
factor bb is recognized by the Golden mean automaton of Figure 1.11. Let Sy 
be the set of words recognized by the automaton with initial state 1 and final 
state q. We derive from the automaton the following set of equations 


S; = §\b+ Sobt+e 
S2 = Sia 


Since S = S; + So, summing up the equations gives 
S=Sb+Sia+e=S(b+ab)+e. 
This gives the expected recurrence relation. 


A second example concerns the Dyck language. 


EXAMPLE 1.7.2. Recall from Example 1.6.2 that the Dyck language D* is 
related to the Lukasiewicz language £ by the relation D*b = L. Let fn be the 
number of words of length n in D and let wu, be the number of words of length 
nin D*. 

It can be verified, using the function 6 of Example 1.6.2, that each word x 
of length 2n + 1 with 6(a) = —1 is primitive and has exactly one conjugate in 
L£. Since u2y is also the number of words of length 2n + 1 in L, one gets 


1 2n+1 1 2n 
Wn = —— = : 
‘ 2n+1 n n+1l\n 


Since D = aD*b, it follows that 

1 (2n—2 

fon = -( ) : 

n\n-1 

The sequence (u2,) is the sequence of Catalan numbers. 
The combinatorial method used to compute the numbers f,, and u, can be 

frequently generalized in the case of more complicated grammars (see Chap- 
ter 9). In the present case, the relation is the following. 


We start with the relation D = aD*b. This implies that the generating 
function D(z) = 039 fn2” satisfies the equation 


D?-D+2=0. 
It follows that 
1-vV1—-42? 
5 . 
An elementary application of the binomial formula gives the the expected ex- 
pression for the coefficient fn. 


D(z) = 
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1.7.2. The Perron—Frobenius theorem 


Several enumeration problems on words involve the spectral properties of non- 
negative matrices. The Perron—Frobenius theorem describes some of these prop- 
erties and constitutes a very important tool in this framework. We shall see in 
the next section several applications of this theorem. 

Let Q be a set of indices (we have of course in mind the set of states of 
a finite automaton). For two Q-vectors v,w with real coordinates, one writes 
uv < wif vg < wg for allg € Q and v < w if vg < Wy for all g € Q. A vector v is 
said to be nonnegative (resp. positive) if v > 0 (resp. v > 0). In the same way, 
for two Q x Q-matrices M,N with real coefficients, one writes M < N when 
My,.q < Nop,q for all p,q € Q and M < N when My, < Npq for all p,q € Q. 
The Q x Q-matrix M is said to be nonnegative (resp. positive ) if M > 0 (resp. 
M > 0). We shall use often the elementary fact that if M > 0 and v > 0 with 
v #0, then Mv > 0. 

A nonnegative matrix M is said to be irreducible if for all indices p,q, there 
is an integer k such that Mj, > 0, where M* denotes the k-th power of M. 
It is easy to verify that M is irreducible if and only if (I + M)” > 0 where n 
is the dimension of M. It is also easy to prove that M is reducible (ie. MW 
is not irreducible) if there is a reordering of the indices such that M is block 
triangular, i.e. of the form 


M= E | (1.7.1) 


with U,W of dimension > 0. 

A nonnegative matrix M is called primitive if there is an integer k such that 
M* > 0. A primitive matrix is irreducible but the converse is not true. 

A nonnegative matrix M is called aperiodic if the greatest common divisor 
of the integers k such that M*, > 0 for some i is equal to 1 (including the 
case where the set of integers k is empty). It can be verified that a matrix is 
primitive if and only if it is aperiodic and irreducible. 

The Perron—Frobenius Theorem asserts that for any nonnegative matrix M, 
the following holds 

1. The matrix MW has a real eigenvalue pj, such that |A| < py for any 
eigenvalue of M. 

2. If M<N with MN, then py < py. 

3. There corresponds to pjy a nonnegative eigenvector v and pyy is the only 
eigenvalue with a nonnegative eigenvector. 

4. If M is irreducible, the eigenvalue pay is simple and there corresponds to 
Pm a positive eigenvector v. 

5. If M is primitive, all other eigenvalues have modulus strictly less than pys. 
Moreover, aM ” converges to a matrix of the form vw, where v (w) is a 
right (left) eigenvector corresponding to p, i.e. Mv = pv (wM = pw) and 
wu = 1. 

We shall give a sketch of a proof of this classical theorem. Let us first show that 
one may reduce to the case where M is irreducible. Indeed, if M is reducible, we 
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may consider a triangular decomposition as in Equation 1.7.1 above. Applying 
by induction the theorem to U and W, we obtain the result with pjz equal to 
the maximal value of the moduli of eigenvalues of U and W. The corresponding 
eigenvector is completed with zeroes (and thus condition 4 fails to hold). 
We suppose from now on that M is irreducible. For a nonnegative Q-vector 
v, let 
ru(v) = min{(Mv);/v; | 1 <i<n,v; 4 0} 


Thus rj¢(v) is the largest real number r such that Mv > rv. The function rj is 
known as the Wielandt function. One has rjg(Av) = ry¢(v) for all real number 
A => 0. Moreover, rj is continuous on the set of nonnegative vectors. 

The set X of nonnegative vectors v such that ||v|| = 1 is compact. Since a 
continuous function on a compact set reaches its maximum on this set, there 
is an x € X such that ry(a) = pm where py = max{ry(w) | w € X}. Since 
ru(v) =ra(Av) for \ > 0, we have pyy = max{ry(w) | w > OF. 

We show that Ma = pagax. By the definition of the function rjz, we have 
Mz > pm. 

Set y = Max — pyx. Then y > 0. Assume Ma 4 pyax. Then y 4 0. Since 
(I+ M)” > 0, this implies that the vector (I + M)"y is positive. But 


(I+ M)"y = (14+ M)"(Ma-pmx) = M(I+M)"a—-pu(I+M)"« = Mz-puz, 


with z = (1+ M)"«. This shows that Mz > pyz, which implies that ras(z) > 
pm, a contradiction with the definition of rj. This shows that pj, is an eigen- 
value with a nonnegative eigenvector. 

Let us show that pyy > |A| for each real or complex eigenvalue of M. 
Indeed, let v be an eigenvector corresponding to A. Then Mv = Xv. Let |v| be 
the nonnegative vector with coordinates |v;|. Then M|v| > Alv| by the triangular 
inequality. By the definition of the Wielandt function, this implies rjq(|v|) > |A| 
and consequently paz > |A|. This completes the proof of assertion 1. 

We have already seen that there corresponds to pjy a nonnegative eigenvector 
x. Let us now verify that « > 0. But this is easy since (I+ M)"ax = (l+pm)"2, 
which implies that (1+ par)"2 > 0 and thus x > 0. 

In order to prove assertion 2, let us consider N such that M < N. Then 
obviously pay < pn. Let us show that pay = pny implies M = N. Let v > 0 
be such that Mv = pyv. Then Nv > pyv and we conclude as above that 
Nv = puv. From Mv = Nv with v > 0, we conclude that WM = N as asserted. 

We now complete the proof of assertion 3. Let Mv = Av with v > 0. Since, 
as above, (I + M)"v = (1+ A)"v, we have actually v > 0. Let D be the 
diagonal matrix with coefficients v,,v2,...,Un and let N = D~!MD. Since 
Nj = Mi,7v;/vi, we have p2F bij = A for 1 <i <n. Let w be a nonnegative 
eigenvector of N for the eigenvalue pay. We normalize w in such a way that 
w; <1 for alli and w,; = 1 for one index t. Then, pyy = a Ntjwj < »; nj = 
A. Thus A = py as asserted. This completes the proof of assertion 3. 

We further have to prove that pz is simple. Let p(A) = det(AI — M) be 
the characteristic polynomial of M. We have p'(A) = }°, det(AJ — M;) where 
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M; is the matrix obtained from M by replacing the i-th row and column by 0. 
Indeed, 
det(AT — M) = det(Ae, — 11,..., Aen — Un) 


where e; is the i-th unit vector and v; is the 7-th column of MW. Since the 
determinant is a multilinear function, this gives the desired formula for p’(A). 
One has M; < M and M; 4 M for each i, because an irreducible matrix cannot 
have a null row. By assertion 2, pay, < pa and thus det(py,I— M;) > 0, whence 
p' (pm) > 0. This shows that the root pas is simple. 

Let us finally prove assertion 5. Let be an eigenvalue of M such that 
|A| = par. Let v be an eigenvector for the eigenvalue \. Then, from Mv = pyv, 
we obtain M|v| > par|v| whence M|v| = pyz|v| by the same argument as above. 
Let k be such that M* > 0. Then, from |M*v| = M*|v|, we deduce that v is 
collinear to a nonnegative real vector. This shows that \ is real and thus that 
A= pM. 

Since M has a simple eigenvalue py strictly greater than every other eigen- 
value, the sequence 7M ” converges to a matrix of rank one, which is thus of 
the indicated form. 

This completes the proof of the Perron—Frobenius theorem. For an indication 
of another proof, see Problem 1.7.1. 


The practical computation of the maximal eigenvalue of a primitive matrix 
M can be done using the following algorithm. It is based on the fact that, 


by Assertion 5 the sequence defined by «("t)) = eon M x) converges to an 


eigenvector corresponding the maximal eigenvalue, and thus r(a‘")) converges 
to an eigenvector. The starting value 2) can be an arbitrary positive vector. 


DOMINANTEIGENVALUE(M, z) 
Ll yea 


2 do (y,x) — (Mz,y) 

3 r—min<i<n yi/®i 
4 yoty 

5 while y # x 

6 return r 


where y © x means that y is numerically close to «x. 

The vector computed by this algorithm is called an approximate eigenvector. 
The definition is the following. Let 1 be nonnegative matrix. Let r be such 
that r < py. Then a vector v such that Mv > rv is called an approximate 
eigenvector relative to r. 


EXAMPLE 1.7.3. Let 
11 
m= [39]. 
The matrix M is nonnegative and irreducible. The eigenvalues of M are y = 


(1+ V5)/2 and @ = (1— V5)/2. The vector x = 7 is an eigenvector relative 
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l is an approximate eigenvector relative to r = 1 and 


Mv is an approximate eigenvector relative to r = 3/2. 


to y. The vector v = E 


1.8. Probability distributions on words 


In this section, we consider the result of randomly selecting the letters composing 
a word. We begin with the formal definition of a probability law ruling this 
selection. 


1.8.1. Information sources 


Given an alphabet A, a probability distribution on the set of words on A is a 
function 7: A* — [0,1] such that z(€) = 1 and for each word x € A*, 


2 (aa) = 7(x). 


acA 


The definition implies that }).- 4, 1() = 1 for all n > 0. Thus a probability 
distribution on words does not make the set of all words a probability space but 
it does for each set A”. 

Probability distributions on words are sometimes defined with a different 
vocabulary. One considers a sequence of random variables (X 1, X2,...,Xn,---) 
with values in the set A. Such a sequence is often called a discrete time in- 
formation source or also a stochastic process. For x = a,---dn with a; € A, 
set 

(a) = P(X, =a1,...,Xn = Gn) 


Then z is a probability distribution in the previous sense. Conversely, if 7 is 
a probability distribution, this formula defines the n-th order joint distribution 
of the sequence (X1, X2,...,Xn,...). We will say that P and 7 correspond to 
each other. 

Two particular cases are worth mentioning: Bernoulli distributions and 
Markov chains. 

First, a Bernoulli distribution corresponds to successively independent choi- 
ces of the symbols in a word, with a fixed distribution on letters. Thus it is given 
by a probability distribution on the set A extended by simple multiplication. 
For a1, @2,...,@n € A, one has m(a1a2°--Gn) = 7(a1)7(a2)--+-7(an). In the 
terminology of information sources, a Bernoulli distribution corresponds to a 
sequence of independent, identically distributed (i.i.d.) random variables. 

For example, if the alphabet has two letters a and b with probabilities 7(a) = 
p and 1(b) = q=1-—p, then 7(w) = pl”leql”l+, The random variable X whose 
value is the number of b’s in a word w of length n has the distribution 


P(X =m) = (”)pnman 
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This distribution is called the binomial distribution. Its expectation and variance 
are 
E(X) =np, Var(X) = npq. 


Second, a Markov chain corresponds to the case where the probability of 
choosing a symbol depends on the previous choice, but not on earlier choices. 
Thus, a Markov chain is given by an initial distribution 7 on A and by an 
Ax A stochastic matrix P of conditional probabilities P(a, b), ie. such that for 
alla € A, \iye4 P(a,b) = 1. Then 


T(G142°++Gn) = 7(a1)P(a1, a2) +++ P(Gn—1, An). 


In terms of stochastic processes, P(a,b) is the conditional probability given for 
alln > 2 by P(a,b) = P(X, =b | Xn-1 =a). The powers of the matrix P allow 
to compute the probability P(X, = a). Indeed, one has P(X, = a) = (7 P”)(a). 


EXAMPLE 1.8.1. Consider the Markov chain over A = {a,b} given by the 


matrix 
1/2 1/2 
1 0 


and the initial distribution 7(a) = 7(b) = 1/2. For example, one has 7(aab) = 
m(aaba) = 1/8. This distribution assigns probability 0 to any word containing 
two consecutive b’s because P(b, b) = 0. 


A distribution 7 on A* is said to be stationary if for all 2 € A*, one has 
m(x) = ogc 7(ax). In terms of stochastic processes, this means that the joint 
distribution does not depend on the choice of time origin, that is, 


P(X; =aji,m <i<n)=P(Xi4q1 = aj,m <i <n) 


A Bernoulli distribution is a stationary distribution. A Markov chain is station- 
ary if and only if 7P = 7, ie. if 7 is an eigenvector of the matrix P for the 
eigenvalue 1. The distribution 7 on A is itself called stationary. 

A Markov chain is irreducible if, for all a,b € A, there exists an integer n > 0 
such that P”(a,b) > 0. This is exactly the definition of an irreducible matrix. 

Similarly, a Markov chain is aperiodic if the matrix P is aperiodic. 

The fundamental theorem of Markov chains says that for any irreducible 
Markov chain, there is a unique stationary distribution 7, and whatever be 
the initial distribution, P(X, = a) tends to z(a). The proof uses the Perron— 
Frobenius theorem. 


EXAMPLE 1.8.2. Consider the Markov chain over A = {a,b} given by the same 
matrix / / 
1/2 1/2 
le a 


and the initial distribution m(a) = 2/3, and 2(b) = 1/3. This Markov chain is 
irreducible, and 7 is its unique stationary distribution. 
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A Markov chain is actually a particular case of a more general concept which 
is a probability distribution on words given by a finite automaton. Let 2% = 
(Q,A) be a finite deterministic automaton. Let 7 be a probability distribution 
on Q. For each state g € Q, consider a probability distribution on the set of 
edges starting in g. This is again denoted by 7. Thus 


S> x(q) = dl, S° 2(q,) =1forallqgeEQ 


qeQ acA 


This defines a probability distribution on the set of paths in 2: given a path 
Y : qo qi +++, we set 1(7) = 1(qo)(G0,40)7(q1,a1)--:. This in turn 
defines a probability distribution on the set of words as follows: for a word w, 
m(w) is the sum of 7(y) over all paths (y) with label w. The probability on 
words obtained in this way is a transfer of a Markov chain on the edges of the 
automaton. 

We now give two examples of probability distributions on words. The first 
one is a distribution given by a finite automaton, the second one is more general. 


EXAMPLE 1.8.3. Consider the automaton given in Figure 1.48. Let 7(1) = 1, 
m(2) = 1(3) = 0, and let w(1,a,2) = 7(1, 6,2) = 1/2, 7(2,a,2) = 7(2,6,2) = 
1/2, and 7(3,a,3) = 0, 7(2,b,2) = 1. The probability 7 induced on words by 


a,b 


a,b 
Figure 1.48. A finite automaton. 


this distribution on the automaton is such that 7(b”) = 1/2 for any word n > 1. 
This distribution keeps an unbounded memory of the past, and is therefore not 
a Markov distribution. 


EXAMPLE 1.8.4. Let t = abbabaabbaababba - -- be the Thue—Morse word which 
is the fixed point of the morphism y which maps a to ab and b to ba. Let S be 
the set of factors of t. Define a function 6 on words in S of length at least 4 as 
follows. For w € S with |w| > 4, set d(w) = v, where v is the unique word of S 


such that : : 
wor xwax if |w| is even, 
uv) = 

wz or xw_ otherwise, 


for some x € {a, b}. 
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For w € S, define 7(w) recursively by z(w) = d((w))/2 if Jw] > 4, and 
by the value given in Figure 1.49 otherwise. It is easy to verify that 7 is an 
invariant probability distribution on S. Indeed, one has 7(¢) = 1 and, for each 
wes, 

(wa) + 7(wb) =7(w), maw) +7(bw) = 7(w). 


o> 
o a 


Figure 1.49. A probability distribution on the factors of the Thue-Morse 
infinite word. 


The notion of a probability distribution on words leads naturally to the def- 
inition of a probability measure on the set A” of infinite words. This allows 
to obtain a real probability distribution, instead of the distribution on each set 
A”. 

Let € be the set of thin cylinders, that is € = {wA” | w © A*}, and let 
X be the o-algebra generated by €. Recall that the o-algebra generated by € 
is the smallest family of sets containing € and closed under complements and 
countable unions. A function yu from a o-algebra © to the real numbers is said 


to be o-additive if 
“(VU En) = Da UE) 


for any family E,, of pairwise disjoint sets from %. A probability measure ju 
on (A”,%) is a real valued function on © such that u(AY’) = 1 and which is 
o-additive. 
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By a classical theorem due to Kolmogorov, for each probability distribution 7 
there exists a unique probability measure js on (A”, ©) such that pu(~@A’) = m(a). 


EXAMPLE 1.8.5. Let us consider again the distribution 7 on A* with A = 
{a,b} of Example 1.8.3. The corresponding probability measure fz on A“ is 
such that u(b”) = 1/2. Indeed, since AY = Ujs0b'aA” U bY one has by the 
property of o-additivity 


ub”) = w(A*) — So u(b'a A’) = 1- S$ a(b'a) = 1 — a (a) = 1/2. 


i>0 i>0 


1.8.2. Entropy 


Let U be a finite set, and let X be a random variable with values in U. Set 
p(w) = P(X =u). We define the entropy of X as 


H(X) = — }° p(w) log p(w) 


ucU 


We use the convention that 0log0 = 0. We also use the convention that the 
logarithm is taken in base 2. In this way, when the set U has two elements 0 
and 1, with p(0) = p(1) = 1/2, then H(X) = 1. More generally, if U has n 
elements, then 

H(X) <logn (1.8.1) 


and the equality H(X) = logn holds if and only if p(u) = 1/n for u € U. 
To prove this statement, we first establish the following assertion: Let p;, qi, 
for (1 <i <n) be two finite probability distributions with p;,q; > 0. Then 


Sop log pi 2 re log qi (1.8.2) 


with equality if and only if pj = q; fori = 1,...,n. 
Indeed, observe first that log.(%) < «—1 for 0 < x with equality if and only 
ifa=1. Thus forl<i<n 


log.(qi/pi) < Gi /pi — 1 


and consequently 
>) pilog.(a:/p:) < >a —-1=0 


This shows the inequality (1.8.2) for the logarithm in base e. Multiplying by an 
appropriate constant gives the general inequality. Equality holds if and only if 
pi = G for all 7. 
If we choose g; = 1/n for all i, inequality (1.8.2) becomes inequality (1.8.1). 
If (X,Y) is a two-dimensional random variable with values in U x V, we set 
p(u,v) = P(X =u, Y =v). Thus 


A(X,Y) == oe p(u, v) log p(u, v) 
(u,v)EUXV 
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Finally, set p(u|v) = P(X = ulY =v). Then we first define 


H(X|v) = — 95 p(ulv) log p(ulv) 


ucU 


and for two random variables X,Y, we set 


H(X|Y) = So A(X|v)p 


veVv 
It is easy to check that 
H(X|Y)=- So plu,v) logp(ule) 
(u,v)EUXV 
It can be checked that 
A(X,Y) = A(Y)+ A(X|Y) (1.8.3) 


Indeed 
Hi, Y) = S > plu, v) log p(u, v) 


=~ Lrlu,) osteo 
= — ¥ plu, og) - 2 Plu») logpt v) 


= H(X|Y) + H(Y) 


It can also be verified that 
H(X,Y) < H(X)+H(Y) (1.8.4) 
and that the equality holds if and only if X and Y are independent. Indeed, 
a eee )log p(w )— Di pte ) log p(w 
=~ Sr(u,0) ) log p(u see (v) 
= Su (u,v) log(p(u)p(u i) 


> 37 plu, v) log p(w, 0) 


U,v 


where the last inequality follows from Inequality (1.8.2). 
More generally, if (X1,...,X,) is an information source, then H, = H(X,, 
., X;7,) is defined as the entropy of the random variable (X,,...,X,). In terms 
of a probability distribution 7, we have 


Ay, =—- S- 1(x) log x(a) 


ZEA” 
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Thus H,, is the entropy of the finite probability space U = A”. 
Assume now that the source (X1,...,Xn,...) is stationary. The entropy of 
the source is defined as 


1 
A= lim —-H, 


n—oo nN 
According to the context, we write indistinctly H or H(X). We show that this 
limit exists. First, observe that, by Inequality (1.8.4), for all m,n > 1 


H(X,,.. yen) = FI Xiy ss Lying) aed Milage Aman) 
Since the source is stationary, H(Xm4i,,---,;Xm+n) = H(%1,,...,Xn). This 
implies 
Thus the sequence (H,,) is a subadditive sequence of positive numbers. This 
implies that H,,/n has a limit. 


We now give expressions for the entropy of particular sources. The entropy 
of a Bernoulli distribution 7 is 


H = — So r(a) log(n(a)) 
acA 


Indeed, H,, = nH since the random variables X,,...,X,, are independent and 
identical. As a particular case, if 7(a) = 1/q for all a € A, then H = logg. 
The entropy of an irreducible Markov chain with matrix P and stationary 


distribution 7 is 
= x 1(a)H® 
acA 
where H* = —)°,- 4 P(a,b) log P(a,b). Indeed, by Formula (1.8.3), one has 
n—-1 
H(X1,...,Xn) = H(X1) + S> H(Xe4i | Xx) 
k=1 


By definition, 
HOG | 2) =, Aas |G HG POG =a} 
acA 


and H(Xn+1 | Xn = a) = H*. Since P(X, = a) tends to m(a), H(Xn41 | Xn) 
tends to 0-4 H°n(a). This implies 


1 
lim —H(X1,...,Xn) = ) H°r(a). 
n 
acA 


EXAMPLE 1.8.6. Consider again the Markov chain given by 


pe Bi al 


and the initial distribution m(a) = 2/3, and 7(b) = 1/3. Then H* = 1, H® =0 
and H = 2/3. 
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1.8.3. Topological entropy 


For a set S of words, one defines the topological entropy of S as the limit 
; 1 
h(S) = limsup — log sy, 
n 


where s,, is the number of words of length n in S. This entropy is called the 
topological entropy to distinguish it from the entropy defined above. 
Let 7 be a stationary distribution. Let S be the set of words x € A* such 
that 7(x) > 0. Then 
H(m) < h(S). 


Indeed, in view of Inequality (1.8.1), one has for each n > 1, 
A, < log sy 


where s, is the number of words of length n in S. The inequality follows 
by taking the limit. Thus the topological entropy of S is an upper bound to 
the value of possible entropies related to a stationary probability distribution 
supported by S. 

In the case of a regular set S, the entropy h(S) can be easily computed using 
the Perron—Frobenius theorem. Indeed, let 2 be a deterministic automaton 
recognizing S, and let M be the adjacency matrix of the underlying graph. By 
the Perron—Frobenius theorem, there is a real positive eigenvalue which is the 
maximum of the moduli of all eigenvalues. One has the formula 


h(S) = logy. 


This formula expresses the fact that the number s,, of words of length n in S 
grows as \”. 


EXAMPLE 1.8.7. Consider again the golden mean automaton of Example 1.3.5 
which we redraw for convenience. It recognizes the set S of words without two 
consecutive a’s. We have 


Figure 1.50. The golden mean automaton. 


and \ = (1+ /5)/2; Thus h(S) = log(1 + /5)/2. 
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1.8.4. Distribution of maximal entropy 


As we have seen in the previous section, the topological entropy of a set S is an 
upper bound to the value of possible entropies related to a stationary probability 
distribution supported by S. A probability distribution 7 supported by S such 
that H(m) = h(S) is called a distribution of maximal entropy. Intuitively, a 
distribution of maximal entropy on a set of words S is such that all words of S 
of given length have approximately the same probability. 

We are going to show that for each rational set S, there exists a distribution 
m supported by S of maximal entropy, i.e. such that H(m) = h(S). 

We consider a deterministic automaton 2 recognizing S. Let G be the 
underlying graph of 2 with labels removed. We assume that G is strongly 
connected. Let Q be the set of vertices of G and let M be its adjacency matrix. 
The fact that G is strongly connected is equivalent to the property of M to be 
irreducible. 

By the Perron—Frobenius Theorem, an irreducible matrix M has a real pos- 
itive simple eigenvalue A larger than or equal to the modulus of any other 
eigenvalue. 

The number of paths of length n in G is asymptotically equivalent to \”. 
We prove that there is a labelling of G by positive real numbers which results in 
a Markov chain on Q of entropy log A. This produces a probability distribution 
on the words of S$ exactly which has maximal entropy log A. 

Again by the Perron—-Frobenius theorem, there exist a right eigenvector v 
and a left eigenvector w for the eigenvalue \ such that all v; and w; are strictly 
positive. We normalize v and w such that v- w= 1. Let 


Piz = (vj/Avi)mi,; 
and let 7; = v;jw;. The matrix P is stochastic since 
» foe => So (u;/Avi)ma,3 => S- vjMi,j / AV; => 1 . 
J J J 


The Markov chain with transition matrix P and initial distribution 7 is station- 
ary. Indeed, 


> T4Pi 5 = S¢ viwi(vj/Avi)mij = (v;/A) . WIN 5 = (u;/A)Aw; =15. 


The entropy of the Markov chain is log A. Indeed, the probability of any path 
y of length n from i to 7 is 
Wi U5 
Py) = ye 

This proves the existence of a distribution with maximal entropy on S when 
the graph of the automaton is strongly connected. In this case, the uniqueness 
can also be proved. The existence in the general case can be shown to reduce 
to this one. 
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bl 


Figure 1.51. The golden mean automaton with transition probabilities. 


EXAMPLE 1.8.8. Let us consider again the golden mean automaton of Exam- 
ple 1.3.5 which recognizes the set S of words without aa. We have 


ue [2]. «-ton, avr [i] 


1a 
—|e e@ = ¢ 1 
p=|7 5]: += [ee], 
The values of the transition probabilities are represented on the the automaton 


of Figure 1.51. The probability distribution on words induced by this Markov 
chain is pictured in Figure 1.52. 


Figure 1.52. The tree of the golden mean. 


As a consequence of the above construction, the distribution of maximal 
entropy associated with a rational set S is given by a finite automaton. It is 
even more remarkable that it can be given by the same automaton than the 
set S itself. This appears clearly in the above example where the automaton of 
Figure 1.51 is the same as the golden mean automaton of Figure 1.11. 


1.8.5. Ergodic sources and compressions 


Consider a source X = (X1, X9,...,Xn,..-) on the alphabet A associated to 
a probability distribution 7. Given a word w = a ,---a, on A, denote by 
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fn(w) the frequency of occurrences of the word w in the first N terms of the 
sequence X. 

We say that the source X is ergodic if for any word w, the sequence fy (w) 
tends almost surely to m(w). An ergodic source is stationary. The converse is 
not true, as shown by the following example. 


EXAMPLE 1.8.9. Let us consider again the distribution of Example 1.8.3. This 
distribution is stationary. We have f(b) = 1 when the source outputs only b’s, 
although the probability of b is 1/2. Thus, this source is not ergodic. 


EXAMPLE 1.8.10. Consider the distribution of Example 1.8.4. This source is 
ergodic. Indeed, the definition of 7 implies that the frequency fx(w) of any 
factor w in the Thue—Morse word tends to 7(w). 


It can be proved that any Bernoulli source is ergodic. This implies in particular 
the statement known as the strong law of large numbers: if the sequence X = 
(X1,X2,...,Xn,-.-) is independent and identically distributed then, setting 
S, = X,+---+ Xn, the sequence 4S converges almost surely to the common 
value E(X;). 

More generally, any irreducible Markov chain equipped with its stationary 
distribution as initial distribution is an ergodic source. 

Ergodic sources have the important property that typical messages of the 
same length have approximately the same probability, which is 2-"” where H 
is the entropy of the source. Let us give a more precise formulation of this 
property, known as the asymptotic equirepartition property. Let (X1,X2,...) be 
an ergodic source with entropy H. Then for any ¢€ > 0 there is an N such that 
for alln > N, the set of words of length n is the union of two sets R and T 
satisfying 

(i) 7(R) <e 
(ii) for each w € T, 
g—n( +e) < m(w) < g—n(H—e) 


where 7 denotes the probability distribution on A” defined by z(a,a2--+an) = 
P(X, = a1,...,Xn = Gy). Thus, the set of messages of length n is partitioned 
into a set R of negligible probability and a set T of “typical” messages having 
all approximately probability 2-"”. 

Since m(w) > 2-"(#+9) for w € T, the number of typical messages satisfies 
Card(T) < 2"(#+9), This observation allows us to see that the entropy gives a 
lower bound for the compression of a text. Indeed, if the messages of length n are 
coded unambiguously by binary messages of average length @, then ¢/n > H —«€ 
since otherwise two different messages would have the same coding. On the 
other hand, any coding assigning different binary words of length n(H + «) 
to the typical messages and arbitrary values to the other messages will give a 
coding of compression rate approximately equal to H. 

It is interesting in practice to have compression methods which are universal 
in the sense that they do not depend on a particular source. Some of these 
methods however achieve asymptotically the theoretical lower bound given by 
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the entropy for all ergodic sources. We sketch here the presentation of one of 
these methods among many, the Ziv-Lempel encoding algorithm. This algorithm 
fits well in our selection of topics because it is combinatorial in nature. 

We consider for a word w the factorization 


W = L{X%Q°**LmU 


where 
1. for each i= 1,...,m, the word x; is chosen the shortest possible not the 
set {%,21,22,...,2;-1}, with the convention xo = e. 
2. the word u is a prefix of some 2;. 
This factorization is called the Ziv-Lempel factorization of w. It appears again 
in Chapter 8. For example, the Fibonacci word has the factorization 


(a)(b) (aa) (ba) (baa) (baab) (ab) (aab) (aba) -- - 


The coding of the word w is the sequence (1, a1), (N2,@2),---,(M%m,@m) where 
my = 0 and a = ay, and for each 7 = 2,...,n, we have 7; = &p,a;, with 
n; <i and a; a letter. Writing each integer n; in binary gives a coding of length 
approximately mlogm bits. It can be shown that for any ergodic source, the 
quantity mlogm/n tends almost surely to the entropy of the source. Thus this 
coding is an optimal universal coding. 

Practically, the coding of a word w uses a set D called the dictionary to 
maintain the set of words {x1,...,2;}. We use a trie (see Section 1.3.1) to 
represent the set D. We also suppose that the word ends with a final symbol to 
avoid coding the last factor u. 


ZLENCODING(w) 
1 p returns the Ziv-Lempel encoding c of w 
2 T—NEWTRIE() 
3 (c,t) < (e,0) 
4 while i < |w| do 
5 (¢,p) — LONGESTPREFIXINTRIE(w, #) 
6 a-—wit¢ 
7 q — NEWVERTEX() 
8 NEXT(p, a) — q¢ > updates the trie T 
9 c—c:(p,a) > appends (p,a) to c 
10 a-—ti+tfl+1 
11 return c 


The result is a linear time algorithm. The decoding is also simple. The 
important point is that there is no need to transmit the dictionary. Indeed, one 
builds it in the same way as it was built in the encoding phase. It is convenient 
this time to represent the dictionary as an array of strings. 
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ZLDECODING(c) 
1 (w,t) — (e,0) 
2 Diilce 


3 while c# «do 
4 (p,a) — CURRENT() — > returns the current pair in c 
5 ADVANCE() 

6 y — Dip] 
7 t-—itl 
8 D{i] — ya > adds ya to the dictionary 
9 w — wya 
10 return w 


The functions CURRENT() and ADVANCE() manage the sequence c, consid- 
ering each pair as a token. The practical details of the implementation are 
delicate. In particular, it is advised not to let the size of the dictionary grow 
too much. One strategy consists in limiting the size of the input, encoding it by 
blocks. Another one is to reset the dictionary once it has exceeded some pre- 
scribed size. In either case, the decoding algorithm must of course also follow 
the same strategy. 


1.8.6. Unique ergodicity 


We have seen that in some cases, given a formal language S, there exists a unique 
invariant measure with entropy equal the topological entropy of the set S. In 
particular, it is true in the case of a regular set S recognized by an automaton 
with a strongly connected graph. In this case, the measure is also ergodic since 
it is the invariant measure corresponding to an irreducible Markov chain. There 
are even cases in which there is a unique invariant measure supported by S. 
This is the so-called property of unique ergodicity . We will see below that this 
situation arises for the factors of fixed points of primitive morphisms. 

Example 1.8.4 is one illustration of this case. We got the result by an 
elementary computation. In the general case, one considers a morphism f : 
A* — A* that admits a fixed point u € A”. Let M be the A x A-matrix 
defined by 

Mab = |F(a)|o 


where |2|_ is the number of occurrences of the symbol a in the word x. We 
suppose the morphism f to be primitive, which by definition means that the 
matrix M itself is primitive. It is easy to verify that for any n, the entry M7", 
is the number of occurrences of b in the word f"(a). 

Since the matrix M associated to the morphism f is primitive it is also 
irreducible. By the Perron—Frobenius theorem, there is a unique real positive 
eigenvalue 4 and a real positive eigenvector v such that vM = Av. We normalize 
v by Vaca Va = 1. 

Using the fact that M is primitive, again by the Perron—Frobenius theorem, 
tM, ap tends to a matrix with rows proportional to vy» when n tends to oo. This 
shows that the frequency of a symbol 0 in wu is equal to vp. 
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The value of the distribution of maximal entropy on the letters is given by 
(a) = Va. For words of length @ larger than 1, a similar computation can be 
carried out, provided one passes to the alphabet of overlapping words of length 
£, as shown in the following example. 


EXAMPLE 1.8.11. Let us consider again the set S of factors of the Thue-Morse 
infinite word ¢ (Example 1.8.4). The matrix of the morphism pz: a > ab, b — ba 


. 1 1 
M = i il | 

The left eigenvector is v = [1/2 1/2] an the maximal eigenvalue is 2. Accordingly, 
the probability of the symbols are z(a) = a(b) = 1/2. To compute by this 
method the probability of the words of length 2, we replace the alphabet A by 
the alphabet Ay = {2,y,z,t} with « = aa, y = ab, z = ba and t = bb. We 
replace 4 by the morphism jg obtained by coding successively the overlapping 
blocks of length 2 appearing in f(A?). 

It is enough to truncate at length 2 in order to get a morphism that has as 
unique fixed point the infinite word tz obtained by coding overlapping blocks of 
length 2 in t. Thus 


Drs yz 
yoy 
“ZH 22 

tre zy 
has the fixed point 

to = ytzyzaytzxyz:-:. 


The matrix associated with 12 is 


i a) 


M®) = 


BerROorF 


0 
1 
0 
0 


(=) 
FOrRF 


The left eigenvector is vo = [1/6 1/3 1/3 1/6], consistently with the values of 7 
given in Figure 1.49. 


1.8.7. Practical estimate of the entropy 


The entropy of a source given by an experiment and not by an abstract model 
(like a Markov chain for example) can usefully be estimated. This occurs in 
practice in the context of natural languages or for sources producing signals 
recorded by some physical measure. 

The case of natural languages is of practical interest for the purpose of text 
compression. An estimate of the entropy H of a natural language like English 
implies for example that an optimal compression algorithm can encode using 
HT bits per character in the average. The definition of a quantity which can be 
called ‘entropy of English’ deserves some commentary. First we have to clarify 
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the nature of the sequences considered. A reasonable simplification is to assume 
that the alphabet is composed of the 26 ordinary letters (and thus without the 
upper/lower case distinction) plus possibly a blank character to separate words. 
The second convention is of different nature. If one wants to consider a natural 
language as an information source, an assumption has to be made about the 
nature of the source. The good approximation obtained by finite automata 
for the description of natural languages makes it reasonable to assume that a 
natural language like English can be considered as an irreducible Markov chain 
and thus as an ergodic source. Thus it makes sense to estimate the probabilities 
by the frequencies observed on a text or a corpus of texts and to use these 
approximations to estimate the entropy H by H ~ H,,/n where 


k 


and where the pz, are the probabilities of the n-grams. One has actually H < 
H,,/n. It is of interest to remark that the approximation thus obtained is much 
better than by using H = h,,/n with 


hyn = log sn 


where s,, is the number of possible n-grams in correct English sentences. For 
small n the approximation is bad because some n-grams are far more frequent 
than others, and for large n the computation is not feasible because the number 
of correct sentences is too large. 

One has H < log,(26) » 4.7 when considering only 26 symbols and H < 
log,(27) = 4.76 on 27 symbols. Further values are given in the table below 
leading to an upper bound H < 3. An algorithm to compute the frequencies 


number of symbols} 26 | 27 


Ty 4.14 | 4.03 
H/2 3.56| 3.32 
H3/3 3.30| 3.10 


Table 1.2. Entropies of n-grams on an alphabet of 26 or 27 letters 


of n-grams is easy to implement. It uses a buffer s which is initialized to the 
initial n symbols of the text and which is updated by shifting the symbols one 
place to the left and adding the current symbol of the text at the last place. 
This is done by the function CURRENT(). The algorithm maintains a set S of 
n-grams together with a map FREQ() containing the frequencies of each n-gram. 
A practical implementation should use a representation of sets like a hashtable, 
allowing to store the set in a space proportional to the size of S (and not to the 
number of all possible n-grams which grows too fast). 
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ENTROPY(n) 
1 p returns the n-th order entropy Hy, 
Sa 9 > S is the set of n-grams in the text 
do s+ CURRENT() ps5 is the current n-gram of the text 
if sé S then 
S+-SUuUs 
FREQ(s) — 1 
else FREQ(s) — FREQ(s) + 1 
while there are more symbols 
for s € S do 
PROB(s) — FREQ(s)/Card $ 


1 
return — S- PROB(s) log PROB(s) 
um ses 


ae 
F COU ON DTK Wb 


— 


Another approach leads to a better estimate of H. It is based on an experi- 
ment which uses a human being as an oracle. The idea is to scan a text through 
a window of n—1 consecutive characters and to ask a subject to guess the sym- 
bol following the window contents, repeating the question until the answer is 
correct. The average number of probes is an estimate of the conditional entropy 
A(X,,|X1,...,Xn—1). The values obtained are shown in Table 1.3. 


n 1 2 3 4 5 6 7 


upper bound] 4.0 3.4 3.0 2.6 2.1 1.9 1.3 
lower bound| 3.2 2.5 2.1 1.8 1.2 1.1 0.6 


Table 1.3. Experimental bounds for the entropy of English 


1.9. Statistics on words 


In this section, we consider the problem of computing the probability of appear- 
ance of some properties on words defined using the concepts introduced at the 
beginning of the chapter. In particular, we shall study the average number of 
factors or subwords of a given type in a regular set. 


1.9.1. Occurrences of factors 


For any integer valued random variable X with probability distribution p, = 
P(X =n), one introduces the generating series f(z) = }0,,59 Pn2". If we denote 
Gn = Yoim>n Pm, then the generating series g(z) = 37,59 M2” is given by the 


formula j (2) 
ees 


This implies in particular that the expectation E(X) = }°,.) npn of X has 
also the expression E(X) = g(1). These general observations about random 
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variables have an important interpretation when the random variable X is the 
length of a prefix in a given prefix code. 

Let a be a probability distribution on A*. For a prefix code C C A%*, the 
value 7(C) = S0.¢¢ 7(x) can be interpreted as the probability that a long enough 
word has a prefix in C. Accordingly, we have 7(C) < 1. 

Let C be a prefix code such that 7(C) = 1.The average length of the words 


of C is 
MC) = S> |x\n(2). 


“EC 


One has the useful identity 
A(C) = m(P) 


where P = A* —CA* is the set of words which do not have a prefix in C. Indeed, 
let py = T(C NA") and Gr = YinsnPm- Then, A(C) = D031 Pn = Lindi In: 
Since 7(P NA”) = qn, this proves the claim. 

The generating series C(z) = 37,59 Pn2” is related to P(z) = )0,,594n2” 

by ~ ~ 
C(z) -1= P(z)(1 —- 2). 

When 7 is a Bernoulli distribution, one may use unambiguous expressions 
on sets to compute probability of events definable in this way. Indeed, the 
unambiguous operations translate to operations on probability generating series. 
If W is set of words, we set 


W(z) = x mW A”)z2”. 
n>0 
Then, if%+V, UV and U* are unambiguous expressions, we have 


(U+V)(2) =U(z)+V(z), (UV)(z) =UZ)V(z), (U*)(2) = ue 


We give below two examples of this method. 
Consider first the problem of finding the expected waiting time T(w) before 
seeing a word w. We are going to show that it is given by the formula 


T(w) = (1.9.1) 


where QO = {q € A* | wq € A*w and |q| < |w|}. Thus Q is the set of (possibly 
empty) words q such that w = sq with s a nonempty suffix of w. 

Let indeed C be the prefix code formed of words that end with w for the first 
time. Let V be the set of prefixes of C, which is also the set of words which do 
not contain w as a factor. We can write 


Vw =CR. (1.9.2) 


Moreover both sides of this equality are unambiguous. Thus, since 7(C) = 1, 
m(V)m(w) = 7(Q), whence Formula (1.9.2). Formula (1.9.2) can also be used 
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to obtain an explicit expression for the generating series C(z). Indeed, using 
(1.9.2), one obtains V(z)a(w)z™” = C(z)Q(z), where m is the length of w. 
Replacing V(z) by (1 — C(z))/(1 — z), one obtains 

m™(w)z™ 
m(w)z™ + Q(z)(1— 2) 
The polynomial Q(z) is called the autocorrelation polynomial of w. Its explicit 
expression is 


C(z) = (1.9.3) 


Q(z) =14+ x T(Wn—p +++ Wn—1)2? 


pEP(w) 


where P(w) is the set of periods of the word w = wo--:Wn—1, and w; denotes 
the i-th letter of w. A slightly more general definition is given in Chapter 6. 


EXAMPLE 1.9.1. In the particular case of w = a™ and A = {a,b} with m(a) = 
p, 7(b) = q =1-—p, the autocorrelation polynomial of w is 


Consequently, 7(R) = (1 — p™)/q and formulas (1.9.1) and (1.9.3) become 


a bs (1 — pz)p"2™ 
i al eS 


so that for p= q = 1/2, 


m (1 = 2/2)2™/2™ 
T(w) =a" 2, Clz)= [2p gmri/gmri 


Formula (1.9.1) can be considered as a paradox. Indeed, it asserts that with 
m(a) = 2(b) = 1/2, the waiting time for the word w = aa is 6 while it is 4 for 
w= ab. 

Formula (1.9.1) is related with the automaton recognizing the words end- 
ing with w and consequently with Algorithm SEARCHFACTOR. We illustrate 
this on an example. Let w = abaab. The minimal automaton recognizing the 
words on {a,b} ending with w for the first time is represented in Figure 1.53. 
The transitions of the automaton can actually be computed using the array b 
introduced in algorithm BORDER. 


0 1 2 3. 4 ~=5 
» ERPDoE 
For example, the transition from state 3 by letter b is to state 2 because b[3] = 1 


and w[l] = b. The set R can also be read on the array b. Actually, we have 
R= {e,aab} since the the border of w has length 2 (b[5] = 2) and b/2] = 0. 
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Figure 1.53. The minimal automaton recognizing the words ending with 
abaab. 


As a second example, we now consider the problem of finding the probability 
fn that the number of a equals the number of 6 for the first time in a word of 
length n on {a,b} starting with a, with 7(a) = p, 7(b) = q=1-p. This is the 
classical problem of return to 0 in a random walk on the line. 

The set of words starting with a and having as many a as b for the first time 
is the Dyck set D already studied in Section 1.6. We have already seen that 
D = aD*b. Thus, the generating series D(z) = 7,39 fanz?” satisfies 


D? — D+ pqz? =0. 


1 — \/1 — 4pqz? 
D(z) = ——-- 
This formula shows in particular that for p = q, t(D) = 1/2 since m(D) = D(1). 
But for p 4 q, 7(D) < 1/2. An elementary application of the binomial formula 
gives the coefficient f, of D(z) = Y) 30 fn2” 


1/2n-—2\ nn 
fan = =( )p q - 
n\n-1 
1.9.2. Extremal problems 


We consider here the problem of computing the average value of several maxima 
concerning words. We assume here that the source is Bernoulli, i.e. that the 
successive letters are drawn independently with a constant probability distribu- 
tion 7. 

We begin with the case of longest run of successive occurrences of some letter 
a with (a) = p. The probability of seeing a run of k consecutive a’s beginning 
at some given position in a word of length n is p*. So the average number 
of runs of length k is approximately np*. Let K, be the average value of the 
maximal length of a run of a’s in the words of length n. 

Intuitively, since the longest run is likely to be unique, we have np*” = 1. 
This equation has the solution AK, = log;,,n. One can elaborate the above 
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intuitive reasoning to prove that 


Kn 
lim ——=1. (1.9.4) 
n—00 logy /p)n 


This formula shows that, in the average, the maximal length of a run of a’s is 
logarithmic in the length of the word. 

A simple argument shows that the same result holds when runs are extended 
to be words over some fixed subset B of the alphabet A. In this case, p is replaced 
by the sum of the probabilities of the letters in BL. 

Another application of the above result is the computation of the average 
length of the longest common factor starting at the same position in two words 
of the same length. Such a factor x induces in two words w and w’ the fac- 
torizations w = uav and w’ = yuxv’ with |u| = |u’|. A factor is just a run of 
symbols (a,a@) in the word (w, w’) written over the alphabet of pairs of letters. 
The value of p for Equation 1.9.4 is 


p= > aa): (1.9.5) 


acA 


The average length of the longest repeated factor in a word is also logarithmic 
in the length of the word. It is easily seen that over a q letter alphabet, the 
length k of the longest repeated factor is at least |log,n] and thus the average 
length of the longest repeated factor is at least log, n. It can be proved that it 
is also O(log n). 

The longest common factor of two words can be computed in linear time. An 
algorithm (LENGTHS-OF-FACTORS) is given in Chapter 2. The average length of 
the longest common factor of two words of the same length is also logarithmic 
in the length. More precisely, let C,, denote the average length of the longest 
common factor of two words of the same length n. Then 


lim 20M =2. 
n—0o log) /p n 


The intuitive argument used to derive Formula 1.9.4 can be adapted to this case 
to explain the value of the limit. Indeed, the the average number of common 
factors of length k in two words of length n is approximately n?p". Solving the 
equation n?p* = 1 gives k = logy)? = 2logy/, 1. 

The case of subwords contrasts with the case of factors. We have already 
given in Section 1.2.4 an algorithm (LCSLENGTHARRAY(z2,y)) which allows 
to compute the length of the longest common subwords of two words. The 
essential result concerning subwords is that the average length c(k,n) of the 
longest common subwords of two words of length n on & symbols is O(n). More 
precisely, there is a constant cz such that 


lim ah) 


k—oo mr 


= Cr. 


Version June 23, 2004 


Problems 91 


This result is easy to prove, even if the proof does not give a formula for cx. 
Indeed, we have c(k,n + m) > c(k,n) + c(k,m) since this inequality holds for 
the length of the longest common subwords of any pair of words. This implies 
that the sequence c(k,n)/n converges (we have already met this argument in 
Section 1.8.2). There is no known formula for c, but only estimates given in 
Table 1.4. 


Table 1.4. Some upper and lower bounds for cx 


Problems 


Section 1.1 


Ledidl 


11.2 


Show that the number of words of length n on q letters with a given 
subword of length k is 


In particular, this number does not depend on the particular word cho- 
sen as a subword. (Hint Consider the automaton recognizing the set of 
words having a given word as subword.) 
Let c: (AUe) x (AUe) > RUoo be a function assigning a cost to 
each pair of elements equal to a symbol or to the empty word. Assume 
that 
(i) the restriction of c to A x A is a distance. 
(ii) c(e,a) = c(a,e) > 0 for allae€ A. 
Each transformation on a word is assigned a cost using the cost c as 
follows. A substitution of a symbol a by a symbol b adds a cost c(a, b). 
An insertion of a symbol a counts for c(e,a) and a deletion for c(a, ¢). 
Let d(u,v) be the distance defined as the minimal cost of a sequence of 
transformations that changes u into v. 
Show that d is a distance on A*. Show that d coincides with 
1. the Hamming distance if c(a, b) = 1 for a 4 band c(a, ¢) = c(e,a) = 
oo. 
2. the subword distance if c(a,b) = co for a,b € A and a F b, and 
c(ae) = c(e,a) = 1 for allae€ A. 
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Section 1.2 


1.2.1. The sharp border array of a word « of length m is the array sb of size 
m +1 such that sb{m] = b[m] and for 1 < 7 < m-—1, sb[j] is the 
largest integer 7 such that x[0..4 — 1] = aj —7..7 — 1] and aj] ¥ afi). 
By convention, sb[j] = —1 if no such integer i exists. For example, if 
x = abaababa, the array sb is 


0 1 2 3 4 5 6 7 8 
b: jt] a oa e | ae 
Show that the following variant of Algorithm BORDER computes the 
array sb in linear time. 


BORDERSHARP(z) 
1 pa has length m, sb has size m+ 1 
2 i-0 
3 sb[0] — -1 
4 for j7—1tom-—1do 
5 > Here x[0..i — 1] = border(a/0..7 — 1]}) 
6 if x{j] = 2[2] then 
7 sb[j] — sbfi] 
8 else sb[j] <7 
9 do i+ sbji] 
10 while i > 0 and x[j] 4 2[?] 
11 i di] 
12 t—t4+1 


13 sblm] — i 
14 return b 


Show that, in Algorithm SEARCHFACTOR, one may use the table sb 
of sharp borders instead of the table b of borders, resulting in a faster 
algorithm. 


Section 1.3 


1.3.1 This exercise shows how to answer the following questions: what is the 
minimal Hamming distance between a word w and the words of a regular 
set X and how to compute a word of X which realizes the minimum? 
These questions are solved by the following algorithm known as Viterbi 
algorithm. 

Let A = (Q,i,T) be a finite automaton over the alphabet A and, for 
each p € Q, let X, be the set recognized by the automaton (Q, i, p). 
Let w = ag---Gn_1 be a word of length n. For a symbol a € A and 
0 <i <n we denote c(a,i) = 0 if a = a; and c(a,i) = 1 otherwise. We 
compute a function d : Q x N — N defined by d(p,i) is the minimal 
Hamming distance of the words in X, A’ to the word ag ---a;_1. 
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1.3.2 


Section 


1.5.1 


Section 


1.7.1 


VITERBI(w) 
1 fori«O0ton—1do 
2 for each edge (p,a,q) do 
3 if d(p,i—1) + c(a,i) < d(q,i) then 
4 d(q,i) — d(p,i — 1) + c(a,#) 


5 return miner d(t,n — 1) 


Show how to modify this algorithm to return a word in X that is closest 
to w. 

Prove that the minimal automaton recognizing the set S(w) of suffixes 
of a word of length n has at most 2n states. Hint: show that for any 
p,q, the sets p-S(w) and q~'S(w) are either disjoint or comparable. 
Conclude that the states of the automaton can be identified with the 
internal nodes of a tree with n leaves corresponding to the elements of 


S. 
1.5 


Let 2% = (Q,I,T) be a transducer over A, B with n states. Let M be 
the maximal length of output labels in the edges of 2. Suppose that 2 
is equivalent to a sequential transducer S, obtained by the determiniza- 
tion algorithm. Let (u,q) € B* x Q be a pair appearing in a state of B. 
Show that |u| < 2n?M. 


1.7 


A rational function is a function of the form f(z) = 7,55 4@n2” such 
that f(z)q(z) = p(z) for two polynomials p,q with q(0) = 1. It is said 
to be nonnegative if a, > 0 for all n > 0. We shall use the fact that if 
f(z) is a nonnegative rational function such that f 4 0 and f(0) = 0, 
then the radius of convergence o of f*(z) = 1/(1— f(z)) is a simple pole 
of f* such that o < |z| for any other pole 7 (see the Notes Section for 
a reference). Moreover a is the unique real number such that f(a) = 1. 
Let M be an n x n irreducible matrix and let 


with N of dimension n — 1. Let f(z) = uz +v(I — zN)~!wz?. Show 
that 
1/(1— f(z)) = (I- Mz)z3. 


Use the result quoted above on nonnegative rational functions to prove 
that 


1. the spectral radius pj, of M is 1/o where o is such that f(o) = 1. 


Version June 23, 2004 


94 


Algorithms on Words 


2. each row of the matrix [(o — z)(I — Mz)~1],~. is a positive eigen- 
vector of M corresponding to 1/o. 


(hint: use the relation J + (I — Mz)~!M = (I— Mz)~}). 


Section 1.8 


1.8.1 


1.8.2 


Consider a primitive morphism f : A* — A* with a fixpoint u € A”. 
We indicate here a method to compute the frequency of the factors of 
length @ in u by a faster method than the one used in Section 1.8.6. Let 
Fy be the set of factors of length @ of u. Let M be the Fy x Fy—-matrix 
defined by Me = |f(x)|,. Let p be an integer such that f?(a) > @—2 for 
alla € A. Let U be the Fy x Fy-matrix defined as follows. For a,b € A 
such that ab € Fy and y € Fe, Uav,y is the number of occurrences of y 
in f?(ab) that begin in the prefix f?(a). Show that 


uM — M@y, 


that M@) and M® have the same dominant eigenvalue p and that if 
v2 is an eigenvector of M) corresponding to p, then vg = v2U is an 
eigenvector of M corresponding to p. 

Let wp : a — ab,b — ba be the morphism with fixpoint the Thue- 
Morse word. Show that for @= 5, p = 3, the matrix U of the previous 
problem (with the 12 factors of length 5 of the Thue-Morse word listed 
in alphabetic order) is 


101101101101 
011011110110 
110101101101 
1O1L101101101 


and that the vector v2U with v2 = [1221] is the vector with all 
components equal to 4. Deduce that the 12 factors of length 5 of the 
Thue-Morse word have the same frequency (see Example 1.8.4). 

Consider the following transformation T on words: a word w is re- 
placed by the word T(w) of the same length obtained as follows: list 
the cyclic shitfs of w in alphabetic order as the rows w1,w2,...,Wn» of 
an array. Then T(w) is the last column of the array. For example, let 
w = abracadabra. The list of conjugates of w sorted in alphabetical 
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order is represented below. 


1234567891011 
aabracada b 
abraabrac 
abracadab 
acadabraa 
adabraabr 
braabraca 
bracadabr 
cadabraab 
9dabraabra 
1l0raabracad 
llracadabra 


CONawnKRWNEH 


esgsorwsenhnagerz ea 
rere ggg arte Ar 


The word T(w) is the last column of the array. Thus in our example 
T(w) = rdarcaaaabb. Show that w ++ T(w) is a bijection. 

1.8.4. Let ws(z) be the generating series of the number of words of length n 
in S, that is 


Show that 
us(z) = S(qz), 


where S(z) is the generating series for the uniform Bernoulli distribution 
on q symbols. 
1.8.5 Show that the generating series of the set F of words over {a, b} without 
factor w = a” is 
Se ae 
a= 1-224 2741" 


Notes 


Several textbooks treat the subject of algorithms on words in much more detail 
than we did here. In the first place, several general textbooks on algorithms like 
Aho, Hopcroft, and Ullman (1975) or Sedgewick (1983) include automata and 
pattern matching algorithms among many other topics. In the second place, 
several books like Crochemore and Rytter (1994), Gusfield (1997) or Baeza- 
Yates and Ribero-Neto (1999) are entirely dedicated to word algorithms. 


Words. The distance introduced in Problem 1.1.2 is known as the edit distance 
or also the alignment distance. It has been introduced first by Levenshtein 
(1965) and it is used in many ways in bioinformatics (see Sankoff and Kruskal 
1983). 

The algorithm VITERBI in Problem 1.3.1 is used in the context of convolu- 
tional error-correcting codes (see McEliece 2002). It appears again in Chapter 4. 
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Elementary algorithms. The algorithm BORDER computing the border of a 
word in linear time and the linear time algorithm SEARCHFACTOR that checks 
whether a word is a factor of another (Algorithm SEARCHFACTOR) are originally 
due to Knuth, Morris, and Pratt (1977). This algorithm is the first one of a 
large family of algorithms constituting the field of pattern matching algorithms. 
See Crochemore, Hancart, and Lecroq (2001) for a general presentation. The 
algorithm BORDERSHARP of Problem 1.2.1 is from Knuth et al. (1977). 

The quadratic algorithm to compute a longest common subword (Algo- 
rithm Lcs) is usually credited to Hirschberg (1977), although many authors 
discovered it independently (see Sankoff and Kruskal 1983). It is not known 
whether there exists or not a linear algorithm. An algorithm working in time 
O(plog n) on two words of length n with p pairs of matching positions is due to 
Hunt and Szymanski (1977). 

The linear algorithm CIRCULARMIN thato computes the least conjugate of a 
word is due to Booth (1980). Several refinements where proposed, see Shiloach 
(1981), Duval (1983), Apostolico and Crochemore (1991). The algorithm giving 
the factorization in Lyndon words (Algorithm LYNDONFACTORIZATION) is due 
to Fredricksen and Maiorana (1978), see also Duval (1983). 


Tries and automata. Tries are treated in many textbooks on algorithms (e.g. 
Aho, Hopcroft, and Ullman 1983). Our treatment of the implementation of 
automata and pattern matching is also similar to that of most textbooks (see 
for example Aho et al. 1975). 

The exact complexity of the minimisation problem for deterministic finite 
automata is not yet known. Moore’s algorithm appears in a historical pa- 
per (Moore 1956). Hopcroft’s minimization algorithm appears first in Hopcroft 
(1971). The linear minimization algorithm for DAWG’s is from Revuz (1992). 
It can be considered as an extension of the tree isomorphism algorithm in Aho 
et al. (1975). 

Gilbreath’s card trick (Example 1.3.9) is described as follows in Chapter 9 
of Gardner (1966): Consider a deck of 2n cards ordered in such a way that red 
and black cards alternate. Cut the deck into two parts and give it a riffle shuffle. 
Cut it once more, this time not completely arbitrarily but at a place where two 
cards of the same colour meet. Square up the deck. 

Then for every i = 1,...,n the pair consisting of the (21 — 1)-th and the 
2i-th card is of the form (red, black) or (black, red). The property of binary 
sequences underlying the card trick is slightly less general than Formula 1.3.1. 

The source of Exercise 1.3.2 is Blumer, Blumer, Haussler, Ehrenfeucht, Chen, 
and Seiferas (1985). The automaton can be used in several contexts, including 
as a transducer called the suffix transducer (see Chapter 2). 


Pattern Matching. The equivalence of regular expressions and finite automata 
is a classical result known as Kleene’s theorem. We present here only one di- 
rection of this result, namely the construction of finite automata from regular 
expressions. This transformation is used in many situations. Actually, regular 
expressions are often used as a specification of some pattern and the equivalent 
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finite automaton can be considered as an implementation of this specification. 
The converse transformation gives rise to algorithms that are less frequently 
used. One case of use is for the computation of generating series (see Section 
1.9). 

There is basically only one method to transform a regular expression into 
a finite automaton which operates by induction on the structure of the regular 
expression. However, several variants exist. The one presented here is due to 
Thompson Thompson 1968. It uses ¢-transitions and produces a normalized 
automaton that has a unique initial state with no edge entering it and a unique 
terminal state with no edge going out of it. Another variant produces an au- 
tomaton without ¢-transitions (see Eilenberg 1974 for example). The resulting 
automaton has in general fewer states than the one obtained by Thomson’s al- 
gorithm. Yet another variant produces directly a deterministic automaton (see 
Berry and Sethi 1986 or Aho, Sethi, and Ullman 1986). 


Transducers. The notion of a rational relation and of a transducer dates back 
to the origins of automata theory although there are few books treating exten- 
sively this aspect of the theory. Eilenberg’s book (Eilenberg 1974) represents a 
significant date in the clarification of the concepts and notation. Later books 
treating transducers include Berstel (1979) and Sakarovich (2004). A word on 
the terminology concerning what we call here sequential transducers. The term 
“sequential machine” (also called “Mealy machine”) is in general used only in 
the case of sequential letter-to-letter transducers. The version using (possibly 
empty) word outputs is often called a “generalized sequential machine” (or gsm). 
A further generalization, used by Schiitzenberger (1977), introduces a class of 
tranducers called subsequential which allow the additional use of a terminal 
suffix. We simply call here sequential these subsequential transducers. 

Any sequential function is a rational function but the converse is of course 
not true. Several characterizations of rational functions, in particular special 
classes of transducers realize rational fuctions. Among these, so-called bima- 
chines used in Chapter 3. The determinization algorithm has been first studied 
in Schiitzenberger (1977) and Choffrut (1979). In particular, the characteriza- 
tion of sequential transducers by the twinning property is in Choffrut (1977, 
1979). See also Reutenauer (1990). 

The source of Problem 1.5.1 is Béal and Carton (2002). It can be checked in 
polynomial time whether a transducer is equivalent to a sequential one (see We- 
ber and Klemm 1995 or Béal and Carton 2002). The normalization algorithm 
has been first considered by Choffrut (1979) and subsequently by Mohri (1994) 
and by Béal and Carton (2001). 

A quite different algorithm relying on shortest paths algorithms has been 
proposed by Breslauer (1998). For recent developments, see the survey on min- 
imization algorithms of transducers in Choffrut (2003). 


Parsing. The section on parsing follows essentially Aho et al. (1986). The Dyck 


language is named after the group theorist Walther von Dyck (Dyck 1882). 
Context-free grammars are an important model for modelling hierarchically 
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structured data. They appear in various equivalent forms, such recursive tran- 
sition networks (RTN) used in natural language processing (see Chapter 3). 
Parsers are ubiquous is data processing systems, including natural language 
processing. The abstract model of a parser is a pushdown automaton which is 
a particular case of the general model of Turing machine. 

It is a remarkable fact that a large class of grammars can be parsed in one 
pass, from left to right, and in linear time. This was first established by Knuth 
(1965) who introduced in particular the LR analysis described here. 


Word enumeration. A detailed proof of the Perron—Frobenius theorem can be 
found in Gantmacher (1959). The proof given here is due to Wielandt, whence 
the name of “Wielandt function”, also called Collatz—Wielandt function (see Al- 
louche and Shallit (2003)). 

Problem 1.7.1 presents the connection between the theorem of Perron—Fro- 
benius and related statements concerning the poles of nonnegative rational func- 
tions. For a proof of the statement appearing at the beginning of the problem, 
see Eilenberg (1974) or Berstel and Reutenauer (1984). This approach gives a 
proof of the Perron—Frobenius theorem that differs from the one given in Sec- 
tion 1.7.2. See McCluer (2000) for a survey on several possible proofs of this 
theorem. 


Probability. Our presentation of probability distributions on words is inspired 
by Welsh (1988), Szpankowski (2001) and Shields (1969). A proof of the funda- 
mental theorem of Markov chains can be found in most textbooks on probability 
theory (see e. g. Feller (1968), chap. XV). The theorem of Kolmogorov on prob- 
ability measures on infinite words can be found in Feller 1971, chap. IV. The 
notion of entropy is due to Shannon (1948). Many textbooks contain a presenta- 
tion of the main properties of entropy (see e. g. Ash 1990). The computation of 
the distribution of maximal entropy (Section 1.8.4) is originally due to Shannon. 
Our presentation follows Lind and Marcus 1996 (chap. 13). 

The computation of the frequencies of factors in fixpoints of substitutions is 
reproduced from Queffélec (1987). The method described in Problem 1.8.1 is 
also from Queffélec (1987). 

The asymptotic equirepartition property of ergodic sources is known as the 
Shannon—McMillan theorem. See Shields (1969) for a proof. The Ziv-Lempel 
coding originally appears in Ziv and Lempel (1977). A complete presentation 
of this popular coding can be found in Bell, Cleary, and Witten (1990) with 
several variants. 

The entropy of english has been studied by Shannon. In particular, tables 
1.2 and 1.3 are from Shannon (1951). They are reproduced in several manuals 
on text compression (see e.g. Welsh 1988 or Bell et al. 1990). 


Statistics on words. Events defined by prefix codes (Section 1.9.1) are presented 
in Feller (1968) under the name of recurrent events. Formula (1.9.1) appears 
already in the paper Schiitzenberger (1964). The name of autocorrelation poly- 
nomial appears in Guibas and Odlyzko (1981b). Formula (1.9.4) is due to Erdés 
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and Rényi (1970) . For a proof, see Waterman (1995) (chap. 11). Formula (1.9.5) 
is due to Arratia, Morris, and Waterman (1988) (see Waterman 1995, chap. 11). 
Table 1.4 is from Sankoff and Kruskal (1983). 

The transformation described in Problem 1.8.3 is known as the Burrows— 
Wheeler transformation (see Manzini (2001)). It is the basis of a text com- 
pression method. Indeed, the idea is that adjacent rows of the table of cyclic 
shifts will often begin by a long common prefix and T(w) will therefore have 
long runs of identical symbols. For example, in a text in english, most rows 
beginning with ‘nd’ will end with ‘a’. 
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2.0. Introduction 


The chapter presents data structures used to memorize the suffixes of a text and 
some of their applications. These structures are designed to give a fast access 
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to all factors of the text, and this is the reason why they have a fairly large 
number of applications in text processing. 

Two types of objects are considered in this chapter, digital trees and au- 
tomata, together with their compact versions. Trees put together common 
prefixes of the words in the set. Automata gather in addition their common 
suffixes. The structures are presented in order of decreasing size. 

The representation of all the suffixes of a word by an ordinary digital tree 
called a suffix trie (Section 2.1) has the advantage of being simple but can lead 
to a memory size that is quadratic in the length of the considered word. The 
compact tree of suffixes (Section 2.2) is ensured to hold in linear memory space. 

The minimization (related to automata) of the suffix trie gives the minimal 
automaton accepting the suffixes and is described in Section 2.4. Compaction 
and minimization yields the compact suffix automaton of Section 2.5. 

Most algorithms that build the structures presented in the chapter work in 
time O(n x log Card A), for a text of length n, assuming that there is an ordering 
on the alphabet A. Their execution time is thus linear when the alphabet is finite 
and fixed. Locating a word of length m in the text then takes O(m x log Card A) 
time. 

The main application of presented techniques is to provide the basis for 
implementing indexes, which is described in Section 2.6. But the direct access 
to factors of a word authorizes a great number of other applications. We briefly 
mention how to detect repetitions or forbidden words in a text (Section 2.7). 
Structures can also be used to search for fixed patterns in texts because they 
can be regarded as pattern matching machines (see Section 2.8). This method is 
extended in a particularly effective way for searching conjugates (or rotations) 
of a pattern in Section 2.8.3. 


2.1. Suffix trie 


The tree of suffixes of a word, called its suffix trie, is a deterministic automaton 
that accepts the suffixes of the word and in which there is a unique path from 
the initial state to any state. It can be viewed as a digital tree which represents 
the set of suffixes of the word. Standard methods can be used to implement 
these automata, but its tree structure authorizes a simplified representation. 

Considering a tree implies that the terminal states of the tree are in one-to- 
one correspondence with the words of the accepted language. The tree is thus 
finite only if its language is also finite. Consequently, the explicit representation 
of the tree has an algorithmic interest only for finite languages. 

Sometimes one forces the tries to have for terminal states only the external 
nodes of the tree. With this constraint, a language £ is representable by a trie 
only if no proper prefix of a word of £ is in L. It results from this remark that 
if y is a nonempty word, only Suff(y) \ {e} is representable by a trie having this 
property, and this takes place only when the last letter of y appears only once 
in y. For this reason one frequently adds for this purpose a marker at the end 
of the word. We prefer to attach an output to the nodes of the tree, which is 
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Figure 2.1. Trie T(ababbb) of suffixes of ababbb. Terminal states are 
marked by double circles. The output associated with a terminal state is 
the position of the corresponding suffix on the word ababbb. The empty 
suffix, by convention, is associated with the length of the word. 


in conformity with the concept used, that of automaton. Only the nodes whose 
output is defined are regarded as terminals. In addition, there are only very 
slight differences between the implementations of the two features. 

The suffix trie of a word y is denoted by T(y). Its nodes are the factors of y, 
e is the initial state, and the suffixes of y are the terminal states. The transition 
function 6 of T(y) is defined by d(u,a) = ua if ua is a factor of y anda € A. 
The output of a terminal state, which is then a suffix, is the position of this 
suffix in y. An example of suffix trie is displayed in Figure 2.1. 

A classical construction of {(y) is carried out by adding successive suffixes 
of y in the tree under construction, from the longest suffix, y itself, until the 
shortest, the empty word. 

The current operation consists in inserting y[?..n — 1], the suffix at position 
7, in the structure which contains already all the longer suffixes. It is illustrated 
by Figure 2.2. We call head of a suffix its longest prefix common to a suffix 
occurring at a smaller position. It is also the longest prefix of y[i..n— 1] that 
is the label of some path starting at the initial state of the automaton in con- 
struction. The target state of the path is called a fork (two divergent paths start 
from this state). If y[¢..& — 1] is the head of the suffix at position i (y[i..nm—1]) 
the word y[k..— 1] is called the tail of the suffix. 

More precisely, one calls fork any state of the automaton which is of (out- 
degree) degree at least 2, or which is both of degree 1 and terminal. 

Algorithm SUFFIXTRIE builds the suffix trie of y. Its code is given below. It 
is supposed that the automaton is represented by lists of successors (adjacency 
list). The list associated with state p is denoted by adj[p| and contains pairs 
of the form (a,q) where a is a letter and q a state. The function TARGET 
implements transitions of the automaton, so TARGET(p, a) is q when (a,q) € 
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Figure 2.2. The trie T(ababbb) (see Figure 2.1) during its construction, 
just after the insertion of suffix abbb. The fork, state 2, corresponds to 
the head, ab, of the suffix. It is the longest prefix of abbb that appears 
before the concerned position. The tail of the suffix is bb, label of the 
path grafted at this stage from the fork and leading to states 12 and 13. 


adj|p| (or more generally when (au, q) € adj[p| for some word wu, as considered 
in next sections). States of the automaton have the attribute output whose 
value is a position. When creating a state, the procedure NEWSTATE allocates 
an empty adjacency list and set as undefined the value of the attribute output. 
Only the output of terminal states is set by the algorithm. The procedure 
NEWAUTOMATON creates a new automaton, say M, with only one state, its 
initial state initial(M). 

In the algorithm, the insertion of the current suffix y[i..n — 1] in the automa- 
ton M under construction, starts with the computation of its head, y[i..k — 1], 
and of the associated fork, p = 6(initial(M), y[i..k — 1]), from which is grafted 
the tail of the suffix (denoting by 6 the transition function of M). The value of 
the function SLOWFIND applied to the pair (initial(Z),7) is precisely the sought 
pair (p,k). The creation of the path of label y[k/..n— 1] from p together with 
the definition of the output of its target is carried out at lines 5-9. 

The last step of the execution, insertion of the empty suffix, just consists 
in defining the output of the initial state, which value is n = |y| by convention 
(line 10). 
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SUFFIXTRIE(y, 7) 

1 M — NEwAUTOMATON() 

2 fori-0ton—1do 

3 (fork, k) — SLOWFIND(initial(M), i) 
4 p< fork 

5 for 7k ton—1do 
6 q — NEWSTATE() 
7 adj[p| — adj[p] U {(yly],@)} 
8 p—4q 

9 output|p] — 7 
0 outputlinitial(M)] —n 
1 return 


SLOWFIND(p, i) 
1 fork-—iton—1do 


2 if TARGET(p, y[k]) is undefined then 
3 return (p, k) 
4 p — TARGET(p, y[k]) 


5 return (p, 7) 


PROPOSITION 2.1.1. Algorithm SUFFIXTRIE builds the suffix trie of a word of 
length n in time O(n?). 


Proof. The correctness is easy to check on the code of the algorithm. 

For the evaluation of execution time, let us consider stage 7. Let us suppose 
that y[i..n— 1] has head y[i..k — 1] and has tail y[&..n— 1]. The call to 
SLOWFIND (line 3) performs k — 7 operations and the for loop at lines 5-8 does 
n — k ones, which gives a total of n — 7 operations. Thus the for loop indexed 
by 7 at lines 2-9 executes n + (n — 1) +---+ 1 operations, which gives a total 
execution time O(n”). rT 


2.1.1. Suffix links 


It is possible to accelerate the preceding construction by improving the search 
for forks. The technique described here is used in the following section where 
it leads to an actual gain in the execution time that is measurable with the 
asymptotic evaluation. 

Let av be a suffix of y with a nonempty head az (a € A). The prefix z of 
v thus appears in y before the considered occurrence. This implies that z is a 
prefix of the head of suffix v. The search for this head and the corresponding 
fork can thus be done by starting in state z instead of starting systematically 
with the initial state as done in the preceding algorithm. However, this supposes 
that, the state az being known, one has a fast access to state z. For that, one 
introduces a function defined on the states of the automaton and called the 
suffi link function. It is denoted by s, and defined, for each state az (a € A, 
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Figure 2.3. The trie {(ababbb) with suffix links of forks and of their 
ancestors indicated by dashed arrows. 


z € A*), by s,(az) = z. State z is called the suffix link of state az. Figure 2.3 
displays in dashed arrows the suffix link function of the trie of Figure 2.1. 

The algorithm SLOWFIND-BIS uses the suffix link function for the compu- 
tation of the suffix trie of y. The function is implemented by a table named 
sf. Suffix links are actually computed there only for the forks and their ances- 
tors, except for the initial state. The rest is just an adaptation of algorithm 
SLOWFIND that includes the definition of the suffix link table sf@. The new 
algorithm is called SLOWFIND-BIS. 


SLOWFIND-BIS(p, k) 
1 while k <n and TARGET(p, y[k]) is defined do 
2 q — TARGET(p, y[k]) 
3 (e, f) — @@) 
4 while e F initial(M) and sé[f] is undefined do 
5 sl|f] — TARGET(s£[e], y[k]) 
6 (e, f) — (sé[e], sé[f]) 
7 if s¢[f] is undefined then 
8 s€[f] <— initial(M) 
9 (p,k) — (q,k +1) 
10 return (p, k) 


Algorithm SUFFIXTRIE-BIS is an adaptation of SUFFIXTRIE. It uses the 
function SLOWFIND-BIS instead of SLOWFIND. 
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SUFFIX TRIE-BIS(y, 7) 
1 M — NEwAUTOMATON() 


2 s€linitial(M)] — initial(M) 
3 (fork, k) — (initial(M),0) 
4 fori-0Oton—1do 
5 k — max{k, i} 
6 (fork, k) — SLOWFIND-BIS(s¢[fork], k) 
7 p< fork 
8 for 7k ton—1do 
9 q ~— NEWSTATE() 
10 adj[p] — adj[p] U {(y[z], a) } 
12 output|p] — 7 


13 outputlinitial(M)| <n 
14 return M 


PROPOSITION 2.1.2. Algorithm SUFFIXTRIE-BIS builds the suffix trie of y in 
time O(CardQ), where Q is the set of states of T(y). 


Proof. The operations of the main loop, apart from line 6 and the for loop at 
lines 8-11, are carried out in constant time, which gives a time O(n) for their 
total execution. 

Each operation of the internal loop of the algorithm SLOWFIND-BIS, which 
is called at line 6, leads to create a suffix link. The total number of links 
being bounded by Card Q, the cumulated time of all the executions of line 6 is 
O(Card Q). 

The execution time of the loop 8-11 is proportional to the number of states 
that it creates. The cumulated time of all the executions of lines 8-11 is thus 
still O(Card Q). 

Consequently, the total time of the construction is O(Card Q), which is the 
announced result. a 


The size of T(y) can be quadratic. This is the case for example for a 
word whose letters are pairwise distinct. For this category of words algorithm 
SUFFIXTRIE-BIS is in fact not faster than SUFFIXTRIE. 

For certain words, it is enough to prune the hanging branches (below the 
forks) of T(y) to obtain a structure of linear size. This kind of pruning gives the 
tree called the position tree of y (see Figure 2.4), which represents the shortest 
factors occurring only once in y or the suffixes that identify other positions. 
However, considering the position tree does not completely solve the question 
of memory space for the structure that can still have a quadratic size. It can be 
checked for example that the word a’ b* ab (k € N) of length 4k has a pruned 
suffix trie that contains more k? nodes. 

The structure of compact tree of the following section is a solution to obtain 
a structure of linear size. The automata of Sections 2.4 and 2.5 provide another 
type of solution. 
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Figure 2.4. Position tree of ababbb. It accepts the shortest factors which 
identify positions on the word, and some suffixes. 


Figure 2.5. The (compact) suffix tree G(ababbb) with its suffix links. 


2.2. Suffix tree 


The compact suffix trie of word y, simply called its suffix tree and denoted by 
G(y), is obtained by removing the nodes of degree 1 which are not terminal in 
its suffix trie. This operation is called the compaction of the trie. The compact 
tree preserves only the forks and the terminal nodes of the suffix trie. Labels of 
edges then become words of positive variable length. Observe that if two edges 
starts from a same node and are labeled by the words u and v then the first 
letters of these words are distinct, i.e., u[0] 4 v[0]. This comes from the fact 
that the suffix trie is a deterministic automaton. 

Figure 2.5 shows the compact suffix tree obtained by compaction of the suffix 
trie of Figure 2.1. 


Version June 23, 2004 


2.2. Suffix tree 109 


a 
fan) 
ar 

ma] 

om] Ee 
mY] Ot 


Figure 2.6. Representation of labels in the (compact) suffix tree 
G(ababbb). (To be compared with the tree in Figure 2.5.) Label (2, 4) 
of edge (3,1) represents the factor of length 4 at position 2 in y, #.e., the 
word abbb. 


PROPOSITION 2.2.1. The compact suffix tree of a word of length n > 0 has 
between n+ 1 and 2n nodes. The number of forks of the tree is between 1 and 
n. 


Proof. The tree contains n + 1 distinct terminal nodes corresponding to the 
n+ 1 suffixes they represent. This gives the lower bound. 

Each fork that is not terminal has at least two children. For a fixed number 
of external nodes, the maximum number of these forks is obtained when each 
one has exactly two children. In this case, one obtains at most n forks (terminal 
or not). As for n > 0 the initial state is both a fork and a terminal node one 
obtains the bound (n+ 1) +m — 1 = 2n on the total number of nodes. rT 


The fact that the compact suffix tree has a linear number of nodes does 
not imply the linearity of its representation, because this also depends on the 
total size of labels of the edges. The example of a word of length n that has n 
distinct letters shows that this size can well be quadratic. Nevertheless, labels 
of edges being all factors of y, each one can be represented by a pair position- 
length (or also starting position-end position), provided that the word y resides 
in memory with the tree to allow an access to the labels. A word u that is the 
label of an edge (p,q) is represented by the pair (i, |u|) where 7 is the position 
of some occurrence of u in y. We write label(p,q) = (i,|u|) and assume that 
the implementation of the tree provides a direct access to this label. This 
representation of labels is illustrated in Figure 2.6 for the tree of Figure 2.5. 


PROPOSITION 2.2.2. Representing labels of edges by pairs of integers, the total 


size of the compact suffix tree of a word is linear in its length, i.e., the size of 


S(y) is O(|y)). 
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Proof. The number of nodes of G(y) is O(|y|) according to Proposition 2.2.1. 
The number of edges of G(y) is one unit less than the number of nodes. The 
assumption on the representation of labels of edges implies that each edge oc- 
cupies a constant space. What gives the result. r 


The suffix link function introduced in the preceding section finds its com- 
plete usefulness in the construction of compact suffix tries. It allows a fast con- 
struction when, moreover, the algorithm SLOWFIND of the preceding section is 
replaced by the algorithm FASTFIND hereafter that has a similar function. The 
possibility of retaining only the forks of the tree, in addition to terminal states, 
rests on the following lemma. 


PROPOSITION 2.2.3. In a suffix trie, the suffix link of a nonempty fork is a 
fork. 


Proof. For a nonempty fork, there are two cases to consider according to whether 
the fork, say au (a € A, u € A*) has degree at least 2, or has degree 1 and is 
terminal. 

Let us suppose first that the degree of au is at least 2. For two distinct 
letters b and c, the words aub and auc are factors of y. The same property then 
holds for u = sy(au) which is thus of degree at least 2 and therefore is a fork. 

If the fork au has degree 1 and is terminal, then aub is a factor of y for some 
letter b and simultaneously au is a suffix of y. Thus, ub is a factor of y and u is 
a suffix of y, which shows that u = s,(au) is also a fork. rT 


The following property is used as a basis for the computation of suffix links 
in the algorithm SUFFIXTREE that builds the suffix tree. We denote by 6 the 
transition function of G(y). 


LEMMA 2.2.4. Let (p,q) be an edge of G(y) and y[j..k — 1], 9 < k, be its 
label. If q is a fork of the tree, then 


(a) ae +1..k—1]) if p is the initial state, 
“i d(sy(p),ylj..k —1]) otherwise. 


Proof. As q is a fork, s,(q) is defined according to Proposition 2.2.3. If p is the 
initial state of the tree, ie., if p = ¢, one has s,(q) = d(e,y[7 + 1..k — 1]) by 
definition of sy. 

In the other case, there is a single path from the initial state ending at p 
because G(y) is a tree. Let av be the nonempty label of this path with a € A and 
v € A* (i.e, p = av). One has d(e,v) = sy(p) and d(e,u-y[j..k — 1]) = sy(q). 
It follows that s,(q) = 6(s,y(p), y[j..&—1]) (the automaton is deterministic), as 
announced. rT 


2.2.1. Construction 


The strategy we select to build the suffix tree of y consists in successively in- 
serting the suffixes of y in the structure, from the longest to the shortest, as in 
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initial fork 
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Figure 2.7. Schema for the insertion of the suffix y[i..n—1] = u-v-w-z 
of y in the (compact) suffix tree during its construction, when the suffix 
link is not defined on the fork a-u-v. Let t be the parent of this fork and 
v be the label of the associated edge. One first computes p = 6(s,(t), v) 
using FASTFIND, then the fork of the suffix using SLOWFIND as in Section 
Dik 


the preceding section. As for the algorithm SUFFIXTRIE-BIS, the insertion of 
the tail of the running suffix is done after a slow find starting from the suffix 
link of the current fork. When this link does not exist it is created (lines 6-11 
of SUFFIXTREE) by using the equality of the preceding statement. Calculation 
is performed by the algorithm FASTFIND that satisfies 


FASTFIND(r, j,k) = 0(r, y[j..& — 1) 
for r state of the tree and j,k positions on y for which 
r-ylj..k—1] is a factor of y. 


The diagram for the insertion of one suffix inside the tree in construction is 
presented in Figure 2.7. 
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SUFFIXTREE(y, n) 
1 M — NEWAUTOMATON() 
2 sflinitial(M)] — initial(M) 
3 (fork, k) <— (initial(M), 0) 
4 fori«-0ton-—1do 


5 k — max{i, k} 
6 if s¢[fork] is undefined then 
7 t — parent of fork 
8 (j, 2) <— label(t, fork) 
9 if t = initial(M) then 
10 £—-1 
11 sé|fork]| — FasTFIND(s¢[t], k — Z, k) 
12 (fork, k) — SLOWFINDC(s¢[fork], k) 
13 if k <nthen 
14 q — NEWSTATE() 
15 adj|fork] — adj{fork] U {((k,n — k),q)} 
16 else q — fork 
17 output|q] <— i 


18 outputlinitial(M)| <n 
19 return M 


Algorithm SLOWFINDC is merely adapted from algorithm SLOWFIND to 
take into account the fact that labels of edges are words. However, when the 
sought target falls in the middle of an edge it is now necessary to cut this edge. 
Let us notice that TARGET(p, a), if it exists, is the state g for which a is the 
first letter of the label of the edge (p,q). Labels can be words of length strictly 
more than 1; thus, it is not true in general that TARGET(p, a) = 6(p, a). 


SLOWFINDC(y, k) 
1 while k <n and TarRGET(p, y[k]) is defined do 


2 q — TARGET(p, y[k]) 

3 (j, 2) <— label(p, q) 

4 tJ 

5 do i~-i+l 

6 ke—k+1 

7 while i <j +@and k <n and y[i] = y[k] 
8 ifi<j+fthen 

9 adj[p| — adj[p] \ {((9, 2), @)} 

10 r — NEWSTATE() 

11 adj|p] — adjlp| U {(G,t— 9), r)} 

12 adj|r] — adj[r] U {((t,4-7+ 9), @)} 
13 return (r, k) 

14 p-—q 


15 return (p,k) 


The improvement on the execution time of the construction of a suffix tree 
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by the algorithm SUFFIXTREE rests, in addition to the compaction of the 
data structure, on an additional algorithmic element: the implementation of 
FasTFIND. Resorting to the particular algorithm describes by the code below 
is essential to obtain the execution time stated in Theorem 2.2.7. 

The algorithm FASTFIND is used to compute a fork. It is applied to state r 
and word y[j..k — 1] only when 


r-ylj..k—1] is a factor of y. 


In this case, from state r there is a path whose label is prefixed by y[j..k — 1]. 
Moreover, as the automaton is deterministic, the shortest of these paths is 
unique. The algorithm uses this property to determine edges of the path by 
only checking the first letter of their label. The code below, or at least its main 
part, implements the recurrence relation given in the proof of Lemma 2.2.5. 

The algorithm FASTFIND is used more precisely for computing the value 
o(r,y|j..k — 1]) (or that of o(r,v) with the notations of the lemma). When 
the end of the traversed path is not the sought state, a state p is created and 
inserted between the last two states met. 


FASTFIND(r, j, k) 
1 bp Computation of d(r, y[j..k — 1) 
2 if 7>kthen 
3 return r 
4 else g — TARGET(r, y[j]) 


5 (7, £) <— label(r, q) 

6 if j+0<k then 

7 return FAsTFIND(q, j + 4, k) 

8 else adj[r] — adj[r] \ {((’,4), a} 

9 p — NEWSTATE() 
10 adj{r] — adj[r] U {((7',k — 3), p)} 
11 adj[p] — adj[p] U {((7’ +k —j,€—k+ Jj), 4)} 
12 return p 


The work of algorithms SLOWFINDC and FASTFIND is illustrated by Figures 
2.8 and 2.9. 


2.2.2. Complexity 


The lemma which follows is used for the evaluation of the execution time of 
FASTFIND(r, j,k). It is an element of the proof of the theorem 2.2.7. It indicates 
that the computing time is proportional (up to a multiplicative coefficient that 
comes from the computing time of transitions) to the number of nodes of the 
traversed path, and not to the length of the label of the path, result which one 
would obtain immediately by applying algorithm SLOWFIND (Section 2.1). 

For a state r of G(y) and a word v for which r- v is a factor of y, we denote 
by end(r, v) the final vertex of the shortest path having origin r and whose label 
has v as a prefix. Observe that end(r,v) = 6(r,v) only if v is the label of the 
path. 
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abababbb 


(b) Suffix ababbb is added. 


Figure 2.8. During the construction of G(abababbb), insertion of suffixes 
ababbb and babbb. (a) Automaton obtained after the insertion of suffixes 
abababbb and bababbb. The current fork is the initial state 0. (b) Suffix 
ababbb is added using letter-by-letter comparisons (slow find) and starting 
from state 0. This results in creating fork 3. The suffix link of 3 is not 
yet defined. 


LEMMA 2.2.5. Let r be a node of G(y) and let v be a word such that r-v 
is a factor of y. Let (r,r1,...,17¢) be the path having origin r and end rp = 
end(r,v) in G(y). The computation of end(r,v) can be carried out in time 
O(é x log Card A) in the comparison model. 


Proof. It is noticed that the path (r,11,...,7¢) exists by the condition “r - v is 
a factor of y” and is unique because the tree is a deterministic automaton. If 
v =e one has end(r,v) =r. If not, let r; = TARGET(r, v[0]) and let v’ be the 
label of edge (1,71). Note that 


ry if |u| < |u’| (Ze, v is a prefix of v’) 


o] 


ae { end(ri,v’~‘v) otherwise. 
This relation shows that each stage of the computation takes time a + @ where 
a is a constant and @ is the computing time of TARGET(r, v[0]). This gives time 
O(log Card A) in the comparison model. 

The computation of r¢ which includes traversing the path (r,r1,...,7¢) thus 
takes time O(¢ x log Card A) as announced. rT 


COROLLARY 2.2.6. Let r be a node of G(y) and j,k two positions on y, j < k, 
such that r-y|j..k—1] is a factor of y. Let ¢ be the number of states of the tree 
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abbb 


(c) Definition of the suffix link of state 3. 


(d) Insertion of babbb. 


Figure 2.9. During the construction of G(abababbb) (continued). (c) The 
first step of the insertion of suffix babbb starts with the definition of the 
suffix link of state 3, which is state 5. This is a fast find process from state 
0 by word bab. (d) The second step of the insertion of babbb leads to the 
creation of state 6. State 5, which is the fork of suffix babbb, becomes the 
current fork to continue the construction. 


traversed during the computation of FASTFIND(r,j,k). Then, the execution 
time of FASTFIND(r, j,k) is O(€ x log Card A) in the comparison model. 


Proof. Let v = y[j..k—1] and let (r,71,...,17¢) be the path ending at end(r, v). 
The computation of end(r,v) is done by FASTFIND that implements the recur- 
rence relation of the proof of Lemma 2.2.5. It thus takes time O(¢ x log Card A). 
During the last recursive call, a state p may be created and related edges mod- 
ified. This operation takes time O(log Card.A). What gives the total time 
O(£ x log Card .A) of the statement. rT 


THEOREM 2.2.7. The computation of SUFFIXTREE(y) = 6(y) takes O(|y| x 
log Card.A) time in the comparison model. 
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Proof. The fact that SUFFIXTREE(y) = G(y) is based mainly on Lemma 2.2.4 by 
checking that the algorithm uses again the elementary technique of Section 2.1. 

The evaluation of the running time rests on the following observations (see 
Figure 2.7): 


e Each stage of the computation done by FASTFIND, except perhaps the 
last stage, leads to traversing a state and strictly increases the value of 
k, — £ (j on the figure), which never decreases. 


e Each stage of the computation done by SLOWFIND, except perhaps the 
last stage, strictly increases the value of k, which never decreases. 


e Each other instruction of the for loop leads to incrementing variable i, 
which never decreases. 


The number of stages done by FASTFIND is thus bounded by |y|, which gives 
O(\y| x log Card.A) time for these stages according to Corollary 2.2.6. The same 
reasoning applies to the number of stages carried out by SLOWFIND, and also 
for the other stages, still giving time O(|y| x log Card A). 

Therefore, one obtains a total execution time O(|y| x log Card.A). rT 


2.3. Contexts of factors 


We present in this section the formal basis for the construction of the minimal 
automaton which accepts the suffixes of a word, and called the suffix automaton 
of the word. Some properties contribute to the proof of the construction of the 
automaton (Theorems 2.3.10 and 2.4.7 below). 

The suffix automaton is denoted by A(y). Its states are classes of the (right) 
syntactic equivalence associated with Suff(y), ie., are the sets of factors of y 
having the same right context within y. These states are in one-to-one corre- 
spondence with the (right) contexts of the factors of y in y itself. Let us recall 
that the (right) context of a word u is Ry(u) = u~'Suff(y). We denote by =, 
the syntactic congruence which is defined, for u,v € A*, by 


U=y U 


if and only if 
Ry(u) = Ry(v). 


One can also identify the states of 2(y) to sets of indices on y which are end 
positions of occurrences of equivalent factors. 

The right contexts satisfy some properties stated below that are used later 
in the chapter. The first remark concerns the link between the relation “is a 
suffix of” and the inclusion of contexts. For any factor u of y, one denotes by 


end-pos(u) = min{|wu] | wu is a prefix of y} — 1, 


the right position of the first occurrence of uin y. Note that end-pos(e) = —1. 
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LEMMA 2.3.1. Let u,v € Fact(y) with |u| < |v|. Then, 
u is a suffix of v implies Ry(v) C Ry(u) 
and 
Ry(u) = Ry(v) implies both end-pos(u) = end-pos(v) and w is a suffix of v. 


Proof. Let us suppose that wu is a suffix of v. Let z € R,(v). By definition, vz is 
a suffix of y and, since wu is a suffix of v, the word uz is also a suffix of y. Thus, 
z € R,(u), which proves the first implication. 

Let us now suppose R,(u) = R,(v). Let w,z be such that y = w- z with 
|w| = end-pos(u) + 1. By definition of end-pos, u is suffix of w. Therefore, z is 
the longest word in R,(u). The assumption implies that z is also the longest 
word in R,y(v), which yields |w| = end-pos(v) + 1. The words wu and v are thus 
both suffixes of w, and as u is shorter than v one obtains that u is a suffix of v. 
This finishes the proof of the second implication and the whole proof. a 


Another very useful property of the congruence is that it partitions the 
suffixes of a factor of y into intervals according to their length. 


LEMMA 2.3.2. Let u,v,w € Fact(y). If u is a suffix of v, v is a suffix of w and 
U=y w, then u=y Vv =y w. 


Proof. By Lemma 2.3.1, the assumption implies 
Ry(w) S Ry(v) S Ry(u). 
Then, the equivalence u =, w which means R,(u) = Ry(w) gives the conclusion. 
r 


A consequence of the following property is that inclusion induces a tree 
structure on the right contexts. In this tree, the parent link is related to the 
proper inclusion of sets. This link, important for the fast construction of the 
automaton, corresponds to the suffix function defined then. 


COROLLARY 2.3.3. Let u,v € A*. Then, the contexts of u and v are compara- 
ble for inclusion or are disjoint, i.e., at least one of the three following conditions 
is satisfied: 
1. Ry(u) v), 
2. Ry(v) u), 
3. Ry(u) ORy(v) = 0. 
Proof. One proves the property by showing that the condition 
Ry(u) A Ry(v) ZO 


C Ry 
Cc Ry( 


implies 

Ry(u)  Ry(v) or Ry(v) S Ry(w). 
Let z € Ry(u) N R,(v). Then, uz, vz are suffixes of y, and u, v are suffixes 
of yz. Consequently, among u and v one is suffix of the other. One obtains 
finally the conclusion by Lemma 2.3.1. rT 
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2.3.1. Suffix function 


On the set Fact(y) we consider the function s, called the suffiz function of y. 
It is defined, for all v € Fact(y) \ {e}, by 


8,(v) = longest suffix u of v such that uF, v. 
After Lemma 2.3.1, one deduces the equivalent definition: 
sy(v) = longest suffix u of v such that R,y(v) C R,(u). 


Note that, by definition, s,(v) is a proper suffix of v (i.e, |sy(v)| < |u|). The 
following lemma shows that the suffix function s, induces a failure function on 
states of 2(y). 


LEMMA 2.3.4. Let u,v € Fact(y) \ {e}. Ifu =, v, then s,(u) = sy(v). 


Proof. By Lemma 2.3.1 one can suppose without loss of generality that u is a 
suffix of v. The word wu cannot be a suffix of s,(v) because Lemma 2.3.2 would 
imply s,(v) =, v, which contradicts the definition of s,(v). Consequently, s,(v) 
is a suffix of u. Since, by definition, s,(v) is the longest suffix of v which is not 
equivalent to itself, it is also s,(u). Thus, s,(u) = s,(v). rT 


LEMMA 2.3.5. Let y € At. The word s,(y) is the longest suffix of y that 
appears at least twice in y itself. 


Proof. The context R,(y) is {e}. As y and s,(y) are not equivalent, R,(s,(y)) 
contains some non empty word z. Then, s,(y)z and s,(y) are suffixes of y, 
which shows that s,(y) appears twice at least in y. 

Any suffix w of y, longer than s,,(y), is equivalent to y by definition of s,(y). 
It thus satisfies R,(w) = Ry(y) = {e}. Which shows that w appears only once 
in y and finishes the proof. rT 


The following lemma shows that the image of a factor of y by the suffix 
function is a word of maximum length in its equivalence class. 


LEMMA 2.3.6. Let u € Fact(y) \ {e}. Then, any word equivalent to s,(u) is a 
suffix of it. 


Proof. Let w = s,(u) and v =, w. We show that v is a suffix of w. The word w 
is a proper suffix of u. If the conclusion of the statement is false, according to 
Lemma 2.3.1 one obtains that w is a proper suffix of v. Let then z € R,(u). As 
w is a suffix of wu equivalent to v, we have z € Ry(w) = R,(v). Then, u and v 
are both suffixes of yz~', which implies that one is a suffix of the other. But this 
contradicts either the definition of w = s,(w) or the conclusion of Lemma 2.3.2, 
and proves that v is a suffix of w = s,(u). rT 


The preceding property is considered in Section 2.8 where the automaton is 
used as a pattern searching engine. One can check that the property of sy is not 
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satisfied in general on the minimal automaton which accepts the factors (and not 
only suffixes) of a word, or, more exactly, is not satisfied on the similar function 
defined from the right congruence defined from Fact(y) (instead of Suff(y)). 


2.3.2. Evolution of the congruence 


The on-line construction of suffix automata relies on the relationship between 
=yq and =y which we examine here. By doing this, we consider that the generic 
word y is equal to wa for some letter a. The properties detailed below are also 
used to derive precise bounds on the size of the automaton in the following 
section. 

The first relation states that =,,, is a refinement of =,,. 


LEMMA 2.3.7. Let w € A* anda€é A. The congruence =wq is a refinement of 
=w, ie.,, for all words u,v € A*, U=wa v implies u=y v. 


Proof. Let us assume that wu =wa v, that is, Rwa(u) = Rwa(v), and show that 
U =w VU, that is, Rw(u) = Ru(v). We show Ri(u) C Rw»(v) only because the 
opposite inclusion results by symmetry. 

If Rw(u) = 9 the inclusion is clear. If not, let z € Rw(u). Then uz is a suffix 
of w, which implies that uza is a suffix of wa. The assumption gives that vza 
is a suffix of wa, and thus vz is a suffix of w, or z € Rw(v), which finishes the 
proof. rT 


The congruence =,, partitions A* in classes. Lemma 2.3.7 amounts to saying 
that these classes are unions of classes according to =wa (a € A). It proves that 
only one or two classes with respect to =, are divided into two subclasses to 
give the partition induced by =,,4. One of these two classes consists of words 
not appearing in w. It contains the word wa itself which produces a new class 
and a new state of the suffix automaton (see lemma 2.3.8). Theorem 2.3.10 and 
its corollaries give conditions for the division of another class and indicate how 
this is done. 


LEMMA 2.3.8. Let w € A* anda é A. Let z be the longest suffix of wa that 
appears in w. If u is a suffix of wa strictly longer than z, then the equivalence 
U =wa wa holds. 


Proof. It is a direct consequence of Lemma 2.3.5 because z occurs at least twice 
in wa. | 


Before going to the main theorem we state an additional relation concerning 
right contexts. 


LEMMA 2.3.9. Let w€ A* anda é A. Then, for each word u € A*, 


{e}URw(u)a if u is a suffix of wa, 


Rw(u)a otherwise. 


Ral) = { 
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Proof. First notice that ¢ € Rwa(u) is equivalent to: wu is a suffix of wa. It is 
thus enough to show Rya(u) \ {e} = Ru(u)a. 

Let z be a nonempty word of Rwa(u). We get that wz is a suffix of wa. The 
word uz can be written uz’a with uz’ a suffix of w. Consequently, 2’ € Rw (wu), 
and thus z € Ry (uja. 

Conversely, let z be a (nonempty) word in R»,(u)a. It can be written z’a for 
z' © Rw(u). Thus, uz’ is a suffix of w, which implies that uz = uz’a is a suffix 
of wa, that is, 2 € Rwa(u). This proves the converse statement and ends the 
proof. rT 


THEOREM 2.3.10. Let w € A* anda eé A. Let z be the longest suffix of wa 
that appears in w. Let z' be the longest factor of w for which z' =, z. Then, 
for each u,v € Fact(w), 


U=wy Vv and u #y z imply u =wa V. 


Moreover, for each word u such as u =w Z, 


_ . if |u| < |z|, 


z' otherwise. 

Proof. Let u,v € Fact(w) be such that u =,, v. By definition of the equivalence 
we get Rw(u) = Rw(v). We suppose first that u #, z and show that Rwa(u) = 
Rwa(v), which is equivalent to u =wa Vv. 

According to Lemma 2.3.9, we have just to show that wu is a suffix of wa if 
and only if v is a suffix of wa. Indeed, it is enough to show that if wu is a suffix of 
wa then v is a suffix of wa since the opposite implication results by symmetry. 

So, let us suppose that wu is a suffix of wa. We deduce from the fact that wu is 
a factor of w and the definition of z that u is a suffix of z. We can thus consider 
the greatest integer 7 > 0 for which |u| < |s,4(z)|. Let us note that s,,/(z) is a 


suffix of wa (like z is), and that Lemma 2.3.2 ensures that u =w Sw?(z). From 
which we get v =w Sw! (z) by transitivity. 

Since u #,, z, we have j > 0. Lemma 2.3.6 implies that v is a suffix of s,,4(z), 
and thus also of wa as wished. This proves the first part of the statement. 

Let us consider now a word u such as u =y Z. 

When |u| < |z|, to show u =wa z by using the above argument, we have only 
to check that u is a suffix of wa because z is a suffix of wa. This, in fact, is a 
simple consequence of Lemma 2.3.1. 

Let us suppose |u| > |z|. The existence of such a word u implies 2’ 4 z and 
|z’| > |z| (z is a proper suffix of z’). Consequently, by the definition of z, wu and 
z’ are not suffixes of wa. Using the above argument again, this proves u =wa 2’ 
and finishes the proof. 7 


The two corollaries of the preceding theorem stated below refer to situations 
simple to manage during the construction of suffix automata. 
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COROLLARY 2.3.11. Let w¢€ A* andaé A. Let z be the longest suffix of wa 
that appears in w. Let z' be the longest word such as z' =, z. Let us suppose 
z' =z. Then, for each u,v € Fact(w), 


U =w V implies U=wa Vv. 


Proof. Let u,v € Fact(w) be such that u =, v. We prove the equivalence 
U =wa v. The conclusion comes directly from Theorem 2.3.10 if u 4, z. Else, 
u =w z; by the assumption made on z and Lemma 2.3.1, we get |u| < |z|. 
Finally, Theorem 2.3.10 gives the same conclusion as above. a 


COROLLARY 2.3.12. Let w € A* anda é A. If the letter a does not appear 
in w, then, for each u,v € Fact(w), 


U =w V implies U =wa V- 


Proof. Since a does not appear in w, the word z of Corollary 2.3.11 is the empty 
word. It is of course the longest of its class, which makes it possible to apply 
Corollary 2.3.11 and gives the same conclusion. a 


2.4. Suffix automaton 


The suffiz automaton of a word y is the minimal automaton that accepts the set 
of suffixes of y. It is denoted by 2(y). The structure is intended to be used as 
an index on the word (see Section 2.6) but also constitutes a device to search for 
factors of y within another text (see Section 2.8). The most surprising property 
of the automaton is that its size is linear in the length of y although the number 
of factors of y can be quadratic. The construction of the automaton also takes 
a linear time on a fixed alphabet. Figure 2.10 shows an example of such an 
automaton to be compared with trees in Figures 2.1 and 2.5. 

As we do not force the automaton to be complete, the class of words which 
do not appear in y, whose right context is empty, is not a state of 2(y). 


2.4.1. Size of suffix automata 


The size of an automaton is expressed both by the number of its states and the 
number of its edges. We show that 2(y) has less than 2|y| states and less than 
3\y| edges, for a total size O(|y|). This result is based on Theorem 2.3.10 of 
the preceding section. Figure 2.11 shows an automaton that has the maximum 
number of states for a word length 7. 


PROPOSITION 2.4.1. Let y € A* be a word length n and let st(y) be the 
number of states of A(y). For n = 0, we have st(y) = 1; for n = 1, we have 
st(y) = 2; forn > 1 finally, we have 


n+1<st(y) <2n-1, 
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Figure 2.10. The (minimal) suffix automaton of ababbb. 


and the upper bound is reached if and only if y is of the form ab”~', for two 
distinct letters a, b. 


Proof. The equalities concerning short words can be checked directly including 
st(y) = 3 when |y| = 2. Let us suppose n > 2. The minimal number of states of 
A(y) is obviously n + 1 (otherwise the path labeled by y would contain a cycle 
yielding an infinite number of words recognized by the automaton), minimum 
which is reached with y = a” (a € A). 

Let us show the upper bound. By Theorem 2.3.10, each letter y[i], 2<i< 
n—1, increases by at most two the number of states of A(y[0..i—1]). As the 
number of states of 2(y[O]y[1]) is 3, it follows that 


st(y) < 3+ 2(n— 2) 
=2n-1, 


as announced. 

The construction of a word of length n whose suffix automaton has 2n — 1 
states is still a simple application of the Theorem 2.3.10 by noting that each 
letter y[2], y[3], ..., y[n — 1] must effectively lead to the creation of two states 
during the construction. Notice that after the choice of the first two letters that 
must be different, there is no choice for the other letters. This produces the 
only possible form given in the statement. rT 


LEMMA 2.4.2. Let y € At and let ed(y) be the number of edges of A(y). 
Then, 
ed(y) < st(y) + |y| — 2. 


O*7O-@-O*+O-O-O7-O 
7 070707070 


Figure 2.11. A suffix automaton with the maximum number of states. 
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Figure 2.12. A suffix automaton with the maximum number of edges. 


Proof. Let us call go the initial state of 2((y), and consider the spanning tree of 
longest paths starting at qo in (y). The tree contains st(y) — 1 edges of 2(y) 
because it arrives exactly one edge on each state except on the initial state. 
With each other edge (p,a,q) we associate the suffix wav of y defined as 
follows: u is the label of the path starting at qo and ending at p; v is the label 
of the longest path from q arriving on a terminal state. Doing so, we get an 
injection from the set of concerned edges to the set Suff(y). The suffixes y and 
€ are not concerned because they are labels of paths in the spanning tree. This 
shows that there is at most Card(Suff(y) \ {y,¢}) = |y| — 1 additional edges. 
Summing up the numbers of edges of the two types, we get a maximum of 
st(y) + |y| — 2 edges in 2(y). rT 


Figure 2.12 shows an automaton that has the maximum number of edges for 
a word length 7. 


PROPOSITION 2.4.3. Let y € A* be a word of length n and let ed(y) be the 
number of edges of A(y). For n = 0, we have ed(y) = 0; for n = 1, we have 
ed(y) = 1; for n = 2, we have ed(y) = 2 or ed(y) = 3; finally, for n > 2, we have 


and the upper bound is reached if y is of the form ab”~*c, where a, b and c are 
three pairwise distinct letters. 


Proof. We can directly check the results on short words. Let us consider n > 2. 
The lower bound is immediate and is reached by the word y = a” (a € A). 

Let us examine then the upper bound. By Proposition 2.4.1 and Lemma 2.4.2 
we obtain 


ed(y) < (2n-—1)+n-—2 
= 3n— 3. 


The 2n—1 quantity is the maximum number of states obtained only if y = ab"~+ 
(a,b€ A,a#b). But for a word in this form the number of edges is only 2n—1. 
Thus, ed(y) < 3n—4. 
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Figure 2.13. The suffix automaton 2(aabbabb). Suffix links on states 
are : ffl] = 0, #2) = 1, FR] = 3”, #18] = 3. FB) = 0, Ia) = 4", 
f[4") = 3’, f[5) = 1, f[6) = 3”, f[7] = 4”. The suffix path of 7 is 
(7, 4”, 3',0), which includes all the terminal states of the automaton (see 
Corollary 2.4.6). 


It can be checked that the automaton 2(ab”~7c) (where a,b,c € A with 
Card{a, b,c} = 3) has 2n — 2 states and 3n — 4 edges. rT 


The following statement summarizes Propositions 2.4.1 and 2.4.3. 


THEOREM 2.4.4. The total size of the suffix automaton of a word is linear in 
the length of the word. : 


2.4.2. Suffix links and suffix paths 


Theorem 2.3.10 and its two consecutive corollaries provide the frame of the 
on-line construction of the suffix automaton 2(y). The algorithm controls the 
conditions which appear in these statements by means of a function defined on 
the states of the automaton, the suffix link function, and of a classification of 
the edges in solid and non-solid edges. We define these two concepts hereafter. 

Let p be a state of 2(y), different from the initial state. State p is a class of 
factors of y that are equivalent with respect to equivalence =,. Let u be any 
word in the class (wu 4 € because p is not the initial state). We define the suffix 
link of p, denoted by f,(p), as the congruence class of s,(u). The function fy is 
called the suffix link function of the automaton. According to Lemma 2.3.4 the 
value of s,(w) is independent of the word u chosen in the class of p, which makes 
the definition coherent. The suffix link function is also called a failure function 
and used with this meaning in Section 2.8. An example is given in Figure 2.13. 


For a state p of 2(y), we denote by lIg,,(p) the maximum length of words u 
in the congruence class of p. It is also the length of the longest path starting 
from the initial state and ending at p. The longest paths starting at the initial 
state form a spanning tree for 2(y) (consequence of Lemma 2.3.1). Edges which 
belong to this tree are qualified as solid. In an equivalent way, 


edge (p,a,q) is solid 
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if and only if 
Ig,(q) = Igy (p) +1. 


This notion is used in the construction of the automaton. 
Suffix links induce by iteration what we call suffix paths in A(y) (see Fig- 
ure 2.13). One can note that 


d= fy(p) implies Ig,,(q) < Ig,(p). 


So, the sequence 
(p, fy(p), in ©), + :) 


is finite and ends at the initial state (which does not have a suffix link). It is 
called the suffix path of p in 2(y), and is denoted by SP(p). 

Let last be the state of A(y) that is the class of word y itself. This state is 
characterized by the fact that it is not the origin of any edge. The suffix path 
of last, 

(last, fy (last), fy? (last), ..., fy" (last) = qo), 


where qo is the initial state of the automaton, plays an important part in the 
on-line construction. It is used to effectively test conditions of Theorem 2.3.10 
and its corollaries. We denote by 6 the transition function of 2(y). 


PROPOSITION 2.4.5. Let u € Fact(y) \ {e} and let p = 6(qo,u). Then, for each 
integer j > 0 for which s,/(u) is defined, 


fy’ (v) = 6(qo, 8y3(u)). 


Proof. We prove the result by recurrence on j. If 7 = 0, fy/(p) = p and 
sy)(u) = u, therefore the equality is satisfied by assumption. 

Let j > 0 such as s,/(u) is defined and suppose by recurrence assumption 
that f,?~*(p) = 6(i, sy3—1(u)). By definition of fy, fy(fy?~‘(p)) is the congru- 
ence class of the word s,(s,/~!(u)). Consequently, fy’ (p) = 6(go, Sy (u)), which 
completes the recurrence and the proof. 7 


COROLLARY 2.4.6. The terminal states of A(y) are the states of the suffix path 
of last, SP(last). 


Proof. First, we prove that states of the path suffix are terminal. Let p be any 
state of SP(last). One has p = fy’ (last) for some j > 0. Since last = 5(qo, y), 
Proposition 2.4.5 implies p = 6(qo, ,/(y)). And as sy?(y) is a suffix of y, p is a 
terminal state. 

Conversely, let p be a terminal state of 2(y). Let then u be a suffix of y such 
that p = 6(qo,u). Since wu is a suffix of y, we can consider the greatest integer 
j = 0 for which |u| < |s,7(y)|. By Lemma 2.3.2 one obtains u =, 8,/(y). Thus, 
p = 6(qo, 8y/(y)) by definition of A(y). Therefore, Proposition 2.4.5 applied to 
y implies p = f,’ (last), which proves that p appears in SP(last). This ends the 
proof. rT 
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2.4.3. On-line construction 


It is possible to build the suffix automaton of y by applying to the suffix trie of 
Section 2.1 standard algorithms that minimize automata. But the suffix trie can 
be of quadratic size what gives the time and space complexity of this approach. 
We present an on-line construction algorithm that avoids this problem and works 
in linear space with an execution time O(|y| x log Card A). 

The algorithm treats the prefixes of y from the shorter, ¢, to the longest, y 
itself. At each stage, just after having treated prefix w, the following information 
is available: 

e The suffix automaton A(w) with its transition function 6. 

e The table f, defined on the states of 2(w), which implements the suffix 

function fy. 

e The table L, defined on the states of 2(w), which implements the function 

length, lIg,,,. 

e The state last. 

Terminal states of 2(w) are not explicitly marked, they are given implicitly 
by the suffix path of last (Corollary 2.4.6). The implementation of 2((w) with 
these additional elements is discussed just before the analysis of complexity of 
the computation. 

Algorithm SUFFIXAUTOMATON that builds the suffix automaton of y relies 
on the procedure EXTENSION given further. This procedure treats the next 
letter of word y. It transforms the suffix automaton 2(w) already built into 
the suffix automaton 2(wa) (wa is a prefix of y, a € A). After all extensions, 
terminal states are eventually marked explicitly (lines 7 to 10). 


SUFFIXAUTOMATON(y, n) 


1 M — NEwAUTOMATON() 

Llinitial(M)| — 0 

last|M] — initial(M) 

for each letter a of y, sequentially do 
> Extension of M by letter a 
EXTENSION(a) 

p — last|M] 

do terminal(p) — TRUE 
p= flip] 

while p is defined 

return VW 


FOO OAN ATE Wb 


HH 


Contrary to what happens for the construction of suffix trees, a state- 
splitting operation is necessary in some circumstances. It is realized by the 
algorithm CLONE below. 
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Figure 2.14. Automaton 2(ccccbbccc) on which is illustrated in Fig- 
ures 2.15, 2.16, and 2.17 the procedure EXTENSION(a) according to three 
cases. 


EXTENSION(a) 


new < NEWSTATE() 
L{new] — L{last|[M]] + 1 
p — last|M] 
do adj[p] — adj[p] U {(a, new)} 
p— fip] 
while p is defined and TARGET(p, a) is undefined 
if p is undefined then 
f [new] <— initial(M) 
else g — TARGET(p, a) 
if (p,a,q) is a solid edge, ie., L[p] +1 = L[{q] then 
f[new] — q 
else clone — CLONE(p, a, q) 
f [new] <— clone 
last[M] — new 


CLONE(p, a, 4) 


1 


FOU ANAaA TK Wb 


=a 


clone — NEWSTATE() 
L{clone] — L{p| +1 
for each (b, gq’) € adj[q| do 
adj[clone] — adj[clone] U {(b, q')} 
f [clone] — fq] 
fla] — clone 
do adjlp] — adj[p] \ {(a,4)} 
adj|p] — adjlp| U {(a, clone) } 
p— fp] 
while p is defined and TARGET(p, a) = q 
return clone 


Figures 2.14, 2.15, 2.16, and 2.17 illustrate how the procedure EXTENSION 
works. 
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Figure 2.15. Suffix automaton 2(ccccbbcccd) obtained by extending 
Q(ccccbbccc) of Figure 2.14 by letter d. During the execution of the first 
loop of EXTENSION(d), state p traverses the suffix path (9,3,2,1,0). At 
the same time, edges labeled by letter d are created, starting from these 
states and leading to 10, the last created state. The loop stops at the 
initial state. This situation corresponds to Corollary 2.3.12. 


jaaen=/4 $(8)+0)+@9) 


b 


Figure 2.16. Suffix automaton 2(ccccbbcccc) obtained by extending 
A(ceccbbccc) of Figure 2.14 by letter c. The first loop of the procedure 
EXTENSION(c) stops at state 3 = f[9] because an edge labeled by c starts 
from this state. Moreover, the edge (3, c,4) is solid. We obtain directly 
the suffix link of the new state created: f[{10] = 6(3,c) = 4. There is 
nothing else to do according to Corollary 2.3.11. 


THEOREM 2.4.7. Algorithm SUFFIXAUTOMATON builds a suffix automaton, 
that is SUFFIXAUTOMATON(y) is the automaton A(y), for y € A*. 


Proof. We show by recurrence on |y| that the automaton is computed correctly, 
as well as tables Z and f and state last. It is shown then that terminal states 
are computed correctly. 

If |y| = 0, the algorithm builds an automaton consisting of only one state 
which is both an initial and terminal state. No transition is defined. The 
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Figure 2.17. Suffix automaton %(ccccbbcccb) obtained by extending 
A(ccecbbcecc) of Figure 2.14 by letter b. The first loop of the procedure 
EXTENSION(b) stops at state 3 = f[9] because an edge labeled by 6 starts 
from this state. In the automaton 2(ccccbbccc) edge (3, 6,5) is not solid. 
The word cccb is a suffix of ccccbbcccb but ccccb is not, although they both 
lead to state 5. This state is duplicated into the final state 5” that is the 
class of factors cccb, ccb and cb. Edges (3, 6,5), (2, 0,5) and (1, 6,5) of 
%(ccccbbccc) are redirected onto 5” according to Theorem 2.3.10. 


automaton thus recognizes the language {¢} which is Suff(y). Elements f and 
last as well as tables LZ and f are also correctly calculated. 

We consider now that |y| > 0 and that y = wa, for a € A and w € A*. 
We suppose, by recurrence, that the current automaton M is 2(w) with its 
transition function 6,, that go = initial(/), that last = 6.,(qo,w), that the 
table L satisfies L{[p| = lIg,,(p) for any state p, and that the table f satisfies 
f lp] = fw(p) for any state p different from the initial state. 

We first show that the procedure EXTENSION carries out correctly the trans- 
formation of the automaton M, of the variable last, and of the tables D and 
ie 

The variable p of procedure EXTENSION runs through the states of the suffix 
path SP(last) of 2(w). The first loop creates transitions labeled by a targeted 
at the new state new in agreement with Lemma 2.3.8. We have also the equality 
L{new] = Ig,,(new). 

When the first loop stops, three disjoint cases arise: 

1. p is not defined, 

2. (p,a,q) is a solid edge, 

3. (p,a,q) is a non-solid edge. 

Case 1. This situation occurs when the letter a does not occur in w; one has 
then f,(new) = qo. Thus, after the instruction at line 8 the equality f[new] = 
fy(new) holds. For the other states r, one has f,,(r) = fy(r) according to 


Corollary 2.3.12. Which gives the equalities f[r] = f,(r) at the end of the 
execution of the procedure EXTENSION. 
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Case 2. Let u be the longest word for which 6(go, u) = p. By recurrence and 
Lemma 2.3.6, we have |u| = Ig,,,(p) = L[p|. The word wa is the longest suffix of y 
which is a factor of w. Thus, f,y(new) = q, which shows that f[new] = f,(new) 
after the instruction of line 11. 

Since edge (p, a, q) is solid, by recurrence again, we have |ua| = L{q] = lg,,(q), 
which shows that the words equivalent to ua according to =, are not longer 
than wa. Corollary 2.3.11 applies with z = ua. And as in the case 1, f[r] = f,(r) 
for all the states different from new. 

Case 3. Let u be the longest word for which 6(qo,u) = p. The word ua 
is the longest suffix of y which is a factor of w. So, fy(new) = q, and thus 
f [new] = f,(new). Since edge (p, a, q) is not solid, ua is not the longest word in 
its congruence class according to =,,. Theorem 2.3.10 applies with z = ua, and 
z’ the longest word for which 6(qo, 2’) = g. The class of ua according to =, is 
divided into two subclasses with respect to =wa. They correspond to states q 
and clone. 

Words v no longer than wa and such as v = ua are of the form v’a with v' a 
suffix of u (consequence of Lemma 2.3.1). Before the execution of the last loop, 
all these words v satisfy q = bw(qo,v). Consequently, just after the execution of 
the loop, they satisfy clone = dy(qo, v), as required by Theorem 2.3.10. Words 
v longer than wa and such as v =, wa satisfy ¢ = dy(go,v) after the execution 
of the loop as required by Theorem 2.3.10 again. On can check that table f is 
updated correctly. 

For each of the three cases, one can check that the value of last is correctly 
computed at the end of the execution of the procedure EXTENSION. 

Finally, the recurrence shows that automaton M, state last, tables LZ and f 
are correct after the execution of procedure EXTENSION. 

It remains to be checked that terminal states are correctly marked during 
the execution of the last loop of algorithm SUFFIXAUTOMATON. But this is 
a straight consequence of Corollary 2.4.6 because variable p runs through the 
suffix path of last. rT 


2.4.4. Complexity 


To analyze the complexity of the algorithm SUFFIXAUTOMATON we first de- 
scribe a possible implementation of the elements necessary for the construction. 

We suppose that the automaton is represented by lists of successors. By 
doing this, operations of addition, update, and access concerning an edge are 
performed in time O(log Card A) with an efficient implementation of the lists. 
Function f, is realized by table f which gives access to f,(p) in constant time. 

To implement the solidity of edges table L is used. It represents the function 
Ig, as the description of the procedure EXTENSION suggests (line 10). Another 
way of doing it consists in using a Boolean value per edge of the automaton. 
This induces a slight modification of the procedure which we describe as follows: 
each first edge created during the execution of the loops at lines 4-6 and lines 7— 
10 must be marked as solid; the other created edges are marked as non solid. 
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This type of implementation does not require the use of table L which can then 
be eliminated, reducing the memory space used. Nevertheless, table L finds its 
utility in applications like those of Section 2.8. We retain that the two types 
of implementation provide a constant-time access to the quality (solid or not 
solid) of an edge. 


THEOREM 2.4.8. Algorithm SUFFIXAUTOMATON can be implemented so that 
the construction of A(y) takes time O(|y| x logCard.A) in a memory space 


O(ly)). 


Proof. We choose an implementation by lists of successors for the transition 
function. States of 2(y) and tables f and L require a space O(st(y)), lists 
of edges a space O(ed(y)). Thus, the complete implementation takes a space 
O(|y|), as a consequence of Propositions 2.4.1 and 2.4.3. 

Another consequence of these propositions is that all the operations carried 
out either once per state or once per edge of the final automaton take a total 
time O(|y| x log Card A). The same result applies to the operations which are 
performed once per letter of y. It thus remains to be shown that the time spent 
for the execution of the two loops at lines 4—6 and lines 7—10 of the procedure 
EXTENSION is of the same order, namely O(|y| x log Card .A). 

We examine initially the case of the first loop. Let us consider the execution 
of the procedure EXTENSION during the transformation of 2(w) into (wa) (wa 
is a prefix of y, a € A). Let u be the longest word of state p during the test at 
line 6. The initial value of u is s,,(w), and its final value satisfies wa = Swa(wa) 
(if p is defined). Let & = |w| — |u|, position of the suffix occurrence of u in w. 
Then, each test strictly increases the value of & during a call to the procedure. 
Moreover, the initial value of & at the beginning of the execution of the next 
call is not smaller than its final value reached at the end of the execution of the 
current call. So, & is never decreased and thus, tests and instructions of this 
loop are done in O(|y|). 

A similar argument applies to the second loop at lines 7-10 of the procedure 
EXTENSION. Let v be the longest word of p during the test of the loop. The 
initial value of v is s,,/(w), for j > 2, and its final value satisfies va = Sya7(wa) 
(if p is defined). Then, the position of v as a suffix of w increases strictly at each 
test during successive calls to the procedure. Thus, again, tests and instructions 
of the loop are done in O(|y|) time. 

Consequently, the cumulated time of the executions of the two loops is 
O(|y| x log Card.A), which finishes the proof. rT 


On a small alphabet, one can still choose an implementation of the automa- 
ton that is even more efficient than that by lists of successors, to the detri- 
ment of memory space however. It is enough to use a transition matrix within 
O(\y| x Card.A) memory space and managed it like a sparse table. With this 
particular management, any operation on edges is done in constant time, which 
leads to the following result. 


THEOREM 2.4.9. When the alphabet is fixed, algorithm SUFFIXAUTOMATON 
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can be implemented so that the construction of (y) takes time O(|y|) in a 
memory space O(|y| x Card A). 


Proof. One can use, to implement the transition matrix, the technique for 
representing sparse tables which gives a direct access to each one of its entries 
while avoiding initializing the complete matrix. a 


2.5. Compact suffix automaton 


In this section, we describe briefly how to build a compact suffix automaton 
denoted by A°(y) for y € A*. This automaton can be seen as the compact 
version of the suffix automaton of the preceding section, i.e., it is obtained 
by removal of the states having only one outgoing transition and that are not 
terminal. It is the process used on the suffix trie of Section 2.1 to produce a 
structure of linear size. 

The compact suffix automaton is also the minimized version, in the sense of 
automata theory, of the (compact) suffix tree of Section 2.2. It is obtained by 
identifying subtrees which recognize the same words. 

Figure 2.18 shows the compact suffix automaton of ababbb that can be com- 
pared to the compact tree of Figure 2.5 and to the automaton of Figure 2.10. 


Exactly as for the tree T(y), in the automaton 2A(y) we call fork any state 
that is of (outgoing) degree at least 2, or that is both of degree 1 and termi- 
nal. Forks of suffix automata satisfy the same property as forks of suffix trees, 
property which allows the compaction of the automaton. The proof of the next 
proposition is an immediate adaptation of that of Proposition 2.2.3. 


PROPOSITION 2.5.1. In the suffix automaton of a word, the suffix link of a 
fork (different from the initial state) is a fork. rT 


When one removes non fork states in 2&(y), edges of the automaton must 
be labeled by (not empty) words and not only by letters. To get a structure 
of size linear in the length of y, labels of edges must not be stored explicitly. 


Figure 2.18. The compact suffix automaton A°(ababbb). 
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Figure 2.19. Representation of labels in the compact suffix automaton 
A°(ababbb). (To be compared with the automaton in Figure 2.18.) 


One represents them in constant space by means of a couple of integers. If 
the word w is label of an edge (p,q), it is represented by the pair (7, |u|) for 
which 7 is the position of an occurrence of u in y. We denote the label by 
label(p, q) = (i,|u]) and suppose that the implementation of the automaton 
provides a direct access to it. This forces to store the word y together with the 
data structure. Figure 2.19 indicates how labels of the compact suffix automaton 
of ababbb are represented. 

The size of compact suffix automata evaluates rather directly from sizes of 
compact suffix trees and of suffix automata. 


PROPOSITION 2.5.2. Let y € A* be a word of length n and let e.(y) be the 
number of states of 2°(y). For n = 0, we have e.(y) = 1; for n > 0, we have 


2<e(y)<n+l1, 
and the upper bound is reached for y= a",a€ A. 


Proof. The result can be checked directly for the empty word. 

Let us suppose n > 0. Let c be a letter, c ¢ A, and let us consider the 
tree G(y-c). This tree has exactly n + 1 external nodes on each one of those 
arrives an edge whose label ends by letter c. The tree has at most n internal 
nodes because they have at least two outgoing edges. When minimized to get 
a compact automaton, all external nodes are identified in only one state, which 
reduces the number of state to n+ 1 at most. Removal of letter c does not 
increase this value, which gives the upper bound. It is immediate to check that 
A°(a") has n+ 1 states exactly and that the obvious lower bound is reached 
when the alphabet of y has size n. a 


PROPOSITION 2.5.3. Let y € A* be a word of length n and let f.(y) be the 
number of edges of A°(y). For n = 0, we have f(y) = 0; for n = 1, we have 
fe(y) = 1; for n > 1, we have 


fe(y) < 2(n — 1), 
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and the upper bound is reached for y = a”—‘b, where a, b are two distinct letters. 


Proof. After checking the results for the short words, one notes that if x is of 
the form a”, n > 1, one has f.(y) = n—1, quantity that is smaller than 2(n—1). 

Let us suppose now that Cardalph(y) > 2. We continue the proof of the 
preceding lemma by still considering the word y-c, c g A. Its suffix tree has at 
most 2n nodes. Thus it has at most 2n — 1 edges, which after compaction gives 
2n — 2 edges since the edges labeled by c disappear. This gives the announced 
upper bound. The automaton 2°(a"~1b) has n states and 2n — 2 edges, as can 
be directly checked. | 


The construction of 2°(y) can be carried out starting from the tree G(y) 
or from the automaton 2(y) (see exercises 2.5.1 and 2.5.2). However, to save 
memory space at construction time one rather takes advantage of a direct con- 
struction. It is the schema of this construction that is sketched here. 

The construction borrows elements from the algorithms SUFFIXTREE and 
SUFFIXAUTOMATON. Thus, the edges of the automaton are marked as solid or 
not solid. The created edges targeted at new leaves of the tree become edges 
to state last. We also use the concepts of slow and fast traversal from the 
construction of suffix trees. It is on these two procedures that the changes are 
essential, and that are added duplications of states and redirections of edges like 
for the construction of suffix automata. 

During the execution of a slow traversal, the attempt at crossing a non- 
solid edge leads to cloning its target, with a duplication similar to what is done 
during the execution of procedure EXTENSION at line 6. One can note that 
certain edges can be redirected by this process. 

The second important point in the adaptation of the algorithms of the pre- 
ceding sections relates to the fast traversal procedure. The main algorithm calls 
it for the definition of a suffix link as in the algorithm SUFFIXTREE. The differ- 
ence comes when the target of a suffix link for a last-created fork (see lines 8-11 
in procedure FASTFIND) is created. If a new state has to be created in the 
middle of a solid edge, the same process applies. But, if the edge is not solid, 
during a first step the edge is only redirected towards the concerned fork, and 
its label is updated accordingly. This leaves the suffix link undefined and leads 
to an iteration of the same process. 

Phenomena that have been just described intervene in any sequential con- 
struction of this type of automaton. Taking them into account is necessary for a 
correct sequential computation of A°(y). They are present in the construction of 
2°(ababbb) (see Figure 2.18) for which three stages are detailed in Figure 2.20. 


To conclude the section, we state the complexity of the direct construction 
of the compact suffix automaton. The formal description and the proof of the 


algorithm are left to the reader. 


PROPOSITION 2.5.4. The computation of the compact suffix automaton A°(y) 
can be done in time O(|y| x log Card A) in a space O(|y]). rT 
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(b) Suffix link of state 2 is defined as state 
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(c) Duplication of state 2. 


Figure 2.20. Three steps of the construction of %°(ababbb). (a) Au- 
tomaton right after the insertion of the three longest suffixes of the word 
ababbb. The suffix link of state 2 is still undefined. (b) Computation by 
fast find of the suffix link of state 2, which results in transforming the 
edge (0, babbb, 1) into (0, b,2). At same time, the suffix 6bb is inserted. 
(c) Insertion of the next suffix, bb, is done by slow find starting from 
state 0. Since edge (0, b, 2) is not solid, its target, state 2, is duplicated 
as 2’ that has the same transitions as 2. To finish the insertion of suffix 
bb it remains to cut the edge (2’, bb, 1) to insert state 3. Finally, the rest 
of the construction amounts to determining final states, and we get the 
automaton of Figure 2.18. 


2.6. Indexes 


Techniques introduced in the preceding sections find immediate applications to 
the design of indexes on textual data. The utility to consider the suffixes of 
a text for this kind of application comes from the obvious remark that any 
factor of a word is a prefix of some suffix of the text. Using suffix structures 
thus provides a kind of direct access to all the factors of a word or a language, 
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and it is certainly the main interest of these techniques. This property gives 
rise to an implementation of an index on a text or a family of texts, with 
efficient algorithms for the basic operations (Section 2.6.2) such as questions 
of membership, location and computation of lists of occurrences of patterns. 
Section 2.6.3 gives a solution in the form of a transducer. 


2.6.1. Implementation of indexes 


The aim of an index is to provide an efficient mechanism for answering certain 
questions concerning the contents of a fixed text. This word is denoted by y 
(y € A*) and its length is n (n € N). An index on y can be regarded as 
an abstract data type whose basic set is the set of factors of y, Fact(y), and 
that includes operations for accessing information related to factors of y. The 
concept is similar to the index of a book which provides pointers to pages from 
a set of selected keywords. We rather consider what can be called a generalized 
index, in which all the factors of the text are present. We describe indexes for 
only one word, but extending methods to a finite number of words is in general 
a simple matter. 

We consider four principal operations on the index of a text. They are related 
to a word z, the query, to be searched for in y: membership, position, number of 
occurrences and list of positions. This set of operations is often extended in real 
applications, in connection with the nature of data represented by y, to yield 
information retrieval systems. But the four operations we consider constitute 
the technical basis from which can be developed broader systems of queries. 

For implementing indexes, we choose to treat the main method that leads 
to efficient and sometimes optimal algorithms. It is based on one of the data 
structures that represent suffixes of y and that are described in previous sections. 
The choice of the structure produces variations of the method. In this section 
we recall the elements of the data structures that must be available to execute 
the index operations. The operations themselves are treated in the next section. 

The implementation of an index is built on automata of the preceding sec- 
tions. Let us recall the data structures necessary to use the suffix tree, G(y), of 
y. They are composed of: 


e The word y itself stored in a table. 


e An implementation of the automaton in the form of a transition matrix 
or list of edges per state, to represent the transition function 6, the access 
to the initial state, and a table of terminal states, for example. 


e The table sé, defined on states, which represents the suffix link function 
of the tree. 


Note that the word y itself must be maintained in memory for the labeling of 
edges refers to it (see Section 2.2). The suffix link is useful for certain applica- 
tions only, it can of course be eliminated when the implemented operations do 
not make use of it. 
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One can also consider the suffix automaton of y, 2(y), which produces in a 
natural way an index on factors of the text y. The structure includes: 


e an implementation of the automaton as for the tree above, 
e the table f that implements the failure function defined on states, 


e the table L that indicates for each state the maximum length of the words 
reaching this state. 


For this automaton it is not necessary to store the word y in memory. It appears 
in the automaton as the label of the longer path starting from the initial state. 
Tables f and L can be omitted if non useful for the set of selected operations. 

Lastly, the compact version of the suffix automaton can be used in order 
to reduce even more the memory capacity necessary to the store the structure. 
Its implementation uses in a standard way the same elements as for the suffix 
automaton (in non-compact version) with, in addition, the word y in order to 
access to labels of edges, as for the suffix tree. One gets a noticeable space 
reduction in using this structure rather than the two preceding ones. 

In the rest of the section we examine several types of solutions for realizing 
basic operations on indexes. 


2.6.2. Basic operations 


We considers in this section four operations related to factors of text y: member- 
ship (in Fact(y)), first position, number of occurrences, and list of the positions. 
The algorithms are presented after the global description of these four opera- 
tions. 

The first operation on an index is the membership of word x to the index, 
i.e., the question of knowing if x is a factor of y. This question can be specified in 
two complementary ways according to whether one expects to find an occurrence 
of x in y or not. If x does not appear in y, it is often interesting in practice 
to find the longest prefix of x which is a factor of y. It is usually the type of 
response necessary to realize sequential searches in text editors. 


Membership Given x € A*, find the longest prefix of a that belongs to 
Fact(y). 


In the contrary case (x € Fact(y)), methods produce without much modifi- 
cation the position of an occurrence of x, and even the position of the first or 
last occurrence of x in y. 


Position Given x a factor of y, find the (left) position of its first (respectively 
last) occurrence in y. 


Knowing that x is in the index another relevant information consists of 
its number of occurrences in y. This information can drive later researches 
differently. 
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Number of occurrences Given x a factor of y, find how many time x appears 
in y. 


Lastly, under the same assumption as before, complete information on the 
location of x in y is provided by the list of positions of its occurrences. 


List of positions Given x a factor of y, produce the list of positions of the 
occurrences of x in y. 


We describe solutions obtained by using the above data structures. It should 
be noticed that the structures sometimes require to be enriched to guarantee an 
efficient execution of the algorithms. 


PROPOSITION 2.6.1. Given one of the automata G(y), A(y) or A(y), com- 
puting the longest prefix u of x that is a factor of y can be carried out in time 
O(|u| x log Card A) within memory space O(|y|). 


Proof. By means of 2(y), to determine the word u, it is enough to follow a path 
labeled by a prefix of x starting from the initial state of the automaton. The 
traversal stops when a transition misses or when x is exhausted. This produces 
the longest prefix of « which is also prefix of the label of a path starting at the 
initial state, i.e., which appears in y since all the factors of y are labels of these 
paths. On the overall, this is done after |u| successful transitions and possibly 
one unsuccessful transition (when u is a proper prefix of x) at the end of the 
test. As each transition takes a time O(log Card.A) for an implementation in 
space O(|y|) (by lists of successors), we obtain a total time O(|u| x log Card A). 

The same process works with G(y) and 2°(y). Taking into account the 
representation of these structures, certain transitions are done by simple letter 
comparisons, but the maximum execution time is unchanged. 7 


Position 


We now examine the operations for which it is supposed that x is factor of y. 
The test of membership which can be carried out separately as in the preceding 
proposition, can also be integrated into the solutions of the other problems that 
interest us here. The use of transducers, which extend suffix automata for this 
type of question, is considered in the following section. 

Finding the position pos,(z) of the first occurrence of x in y amounts to 
calculating its right position end-pos,(x) (see Section 2.3) because 


pos, (x) = end-pos, (x) — |a| +1. 


Moreover, this is also equivalent to computing the maximum length of right 
contexts of x in y, 
Icy(a) = max{|z| | z € Ry(x)}, 


because 
pos, (x) = |y| — Icy(x) — |e]. 
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In a symmetrical way, in order to find the position last-pos,(x) of the last 
occurrence of x in y, it remains to calculate the minimal length sc,(x) of its 
right contexts because 


last-pos,(x) = |y| — sey (x) — |a]. 


To be able to quickly answer requests related to the first or last positions 
of factors of y, structures of index are not sufficient alone, at least if one seeks 
to obtain optimal execution times. Consequently, one precomputes two tables 
indexed by the states of the selected automaton and that represent functions 
Icy and scy. One thus obtains the following result. 


PROPOSITION 2.6.2. Automata G(y), 2(y) and 2°(y) can be preprocessed in 
time O(|y|) so that the first (or last) position on y of a factor x of y, as well as 
the number of occurrences of x, can be computed in time O(|x| x log Card A) 
within memory space O(|y|). 


Proof. Let us call M the selected structure, 6 its transition function, F’ its set 
of edges, and T its terminal states. 

To begin let us consider the computation of pos,(x). The preprocessing 
of M relates to the computation of a table LC defined on states of MW and 
aimed at representing the function Icy. For a state p and a word u € A* with 
p = O(initial(M), u), we define 


LC[p] = Icy(w), 


quantity that is independent of the word u that labels a path from the initial 
state to p, according to Lemma 2.3.1. This value is also the maximum length 
of paths starting at p and ending at a terminal state in the automaton A(y). 
For G(y) and AS(y) this consideration still applies by defining the length of an 
edge as that of its label. 

The table LC satisfies the recurrence relation: 


0 if deg(p) = 0, 


LC[p] = 
P| ane LC{q] | (p,v, ¢) € F and |v] = ¢} otherwise. 


The relation shows that the computation of values LC[p], for all the states of 
M, is done by a simple depth-first traversal of the graph of the structure. As 
its number of states and its number of edges are linear (see sections 2.2, 2.4 
and 2.5) and since the access to the label length of an edge is done in constant 
time according to the representation described in Section 2.2, the computation 
of the table takes a time O(|y|) (independent of the alphabet). 

Once the precomputation of the table LC is performed, the computation of 
pos,(a) is done by searching for p = 6(initial(M),x) and then by computing 
|y| — LC[p] — |x|. We then obtain the same asymptotic execution time as for the 
membership problem, namely O(|a| x log Card A). Let us note that if 


end(initial(M), x) = 6(initial(M), cw) 
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with w non empty, the value of pos,(zx) is then |y| — LC[p] — |rw|, which does 
not modify the asymptotic evaluation of the execution time. 

The computation of the position of the last occurrence of x in y is solved in 
a similar way by considering the table SC defined by 


SC[p] = sc,(u), 


with the notations above. The relation 


0 ifpeT, 
min{¢+ SC{q] | (p,v,q) © F and |v| = 2} otherwise, 


SCip| = { 


shows that the precomputation of SC takes a time O(|y|), and that the compu- 
tation of last-pos,(x) takes then O(|x| x log Card.A) time. 
Lastly, for accessing the number of occurrences of x one precomputes a table 
NB defined by 
NB{p| = Card{z € A* | 6(p, z) € TH, 


which is precisely the sought quantity when p = end(initial(M),«). The linear 
precomputation results from the relation 


Lt Gawer NB{q| if pe Le 


NB|p| = 
P| ~(@oqer NBlal otherwise. 
Then, the number of occurrences of x is obtained by computing the state p = 
end(initial(M),a) and by accessing to NB[p], which is done in the same time 
as for the above operations. 

This ends the proof. a 


An argument similar to the last element of the preceding proof allows an 
effective computation of the number of factors of y, i.e., of the size of Fact(y). 
For that, one evaluates the quantity CS|[p], for all states p of the automaton, by 
using the relation 


be 1 if deg(p) = 0, 
Jaa ee Vow@er(lt] 1+ CS|q]) otherwise. 


If p = d(initial(M), uw) for some factor u of y, CS[p] is the number of factors 
of y starting with u. This gives a linear-time computation of Card Fact(y) = 
CS|qo] (go initial state of the automaton), i.e., in time O(|y|) independent of the 
alphabet, given the automaton. 


List of positions 


PROPOSITION 2.6.3. Given the tree G(y) or the automaton A°(y), the list 
L of positions of the occurrences of a factor x of y can be computed in time 
O(|z| x log Card A+ k) within memory space O(|y|), where k is the number of 
elements in L. 
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Proof. The tree G(y) is first considered. Let us point out from Section 2.1 
that a state q of the tree is a factor of y, and that, if it is terminal, its output 
is the position of the suffix occurrence of q in y (in this case q is a suffix of 
y and output|q] = |y| — |q|). The positions of occurrences of a in y are the 
positions of suffixes prefixed by x. One thus obtains these positions by seeking 
terminal states of the subtree rooted at p = end(initial(M), a) (see section 2.2). 
Exploration of this subtree takes a time proportional to its size and indeed to 
its number of terminal nodes since each node that is not terminal has at least 
two children by definition of the tree. Finally, the number of terminal nodes is 
precisely the number & of elements of the list L. 

In short, the computation of the list require that of p and then the traversal 
of the subtree. The first phase is carried out in time O(|xz| x log Card.A), the 
second in time O(k), which gives the announced result when G(y) is used. 

A similar reasoning applies to 2°(y). Let p = end(initial(M),) and let w 
be such that d(initial(M/),cw) = p. Starting from p, we explore the automaton 
by memorizing the length of the current path (the length of an edge is that of its 
label). A terminal state q that is reached by a path of length ¢ corresponds to a 
suffix of length @ which therefore occurs at position |y| — @. Then, |y| — @— |aw| 
is the position of an occurrence of « in y. The complete traversal takes a time 
O(k) as its equivalent traversal of the subtree of G(y) describes above. We thus 
obtain the same running time as with the compact suffix tree. a 


Notice that the computation of the lists of positions is obtained without pre- 
processing the automata. By the way, using the (non-compact) suffix automaton 
of y requires a preprocessing which consists in creating shortcuts to superimpose 
the structure of A°(y) to it, if one wishes to obtain the same running time. 


2.6.3. Transducer of positions 


Some of the questions of locating factors within the word y can be described in 
terms of transducers, i.e., automata in which edges have an output in addition 
to outputs on states. As an example, the function pos, is realized by the 
transducer of positions of y, denoted by T(y). Figure 2.21 gives an illustration 
of it. 

The transducer T(y) is built upon 2(y) by adding outputs to edges and by 
modifying the outputs associated with the terminal states. Edges of T(y) are 
of the form (p,(a,s),q) where p, q are states and (a,s) the label of the edge. 
Letter a € A is its input and integer s € N is its output. The path 


(po, (ao, 80), P1)5 (pi, (a1, 81), 2), tty (Pr—1, (Qr—1, Sk—1), De) 


of the transducer has as input label the word aga, ---a,—1, concatenation of 
input labels of edges of the path, and for output the sum so + 5; +-+++ Sx_1. 

The transformation of 2(y) into T(y) is done as follows. When (p,a,q) is 
an edge of 2(y) it becomes the edge (p, (a, s),q) of T(y) with output 


s = end-pos,(q) — end-pos,(p) — 1, 
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a:0 “.@) 9.3) 79.4) a:0 “°.6) “°@) > 


b:2 


Figure 2.21. Transducer that realizes in a sequential way the function 
pos, relative to y = aabbabb. Each edge is labeled by a pair (a, s) denoted 
by a:s, where a is the input of the edge and s its output. When scanning 
abb, the transducer produces the value 1 (= 0+1+0), which is the position 
of the first occurrence of abb in y. The last state having output 3, one 
deduces that abb is a suffix at position 4 (= 1+4 3) of y. 


which is also 
LC{p| — LC{q] —1 


with the notation LC used in the proof of Proposition 2.6.2. The output associ- 
ated with a terminal state p is defined as LC[p]. It is shown how to compute the 
table LC in the proof of Proposition 2.6.2, from which one deduces a computa- 
tion of outputs associated with edges and terminal states. The transformation 
is thus carried out in linear time. 


PROPOSITION 2.6.4. Let wu be the input label of a path starting at the initial 

state of the transducer T(y). Then, the output of the path is pos,(u). 
Moreover, if the end of the path is a terminal state having output t, u is a 

suffix of y and the position of this occurrence of u in y is pos,(u) +t (= |y|—|ul). 


Proof. We prove it by recurrence on the length of u. The first step of the 
recurrence, for u = ¢, is immediate. Let us suppose that u = va with v € A* 
anda € A. The output of the path having input label va is r+s, where r and s 
are respectively the outputs corresponding to inputs v and a. By the recurrence 
hypothesis, we have r = pos,(v). By definition of labels in T(y), we have 


s = end-pos,(u) — end-pos,(v) — 1. 
Therefore the output associated with uw is 
pos,(v) + end-pos,(u) — end-pos,(v) — 1, 
or also, since end-pos,(w) = pos,(w) +|w| — 1, 


pos, (uw) + |u| — Jo] — 1, 


Version June 23, 2004 


2.7. Finding regularities 143 


which is pos,(u) as expected. This finishes the proof of the first part of the 
statement. 

If the end of the considered path is a terminal state, its output t is, by 
definition, LC[u], which is |y|— end-pos,(u) — 1 or |y| — pos,(u) — |u|. Therefore 
pos,(u) + t = |y| — |u|, which is the position in y of the suffix u as announced. 

7 


The existence of the transducer of positions described above shows that the 
position of a factor in y can be computed sequentially, while reading the factor. 
The computation is even done in real time when transitions are performed in 
constant time. 


2.7. Finding regularities 


2.7.1. Repetitions 


In this section we examine two questions concerning repetitions of factors within 
the text y. There are two dual problems that are solved efficiently by using a 
suffix tree or suffix automaton: 


e Compute longest repeated factors of y. 
e Find shortest factors having few occurrences in y. 


These questions are parameterized by an integer / which bounds the number of 
occurrences. 


Longest repetition Given an integer k, k > 1, find a longest word occurring 
at least k times in y. 


Let 2(y) be the suffix automaton of y. If the table NB defined in the proof 
of Proposition 2.6.2 is available, the problem of the longest repetition remains 
to find the states p of 2(y) which are the deepest in the automaton and for 
which NB[p| > k. The labels of longest paths from the initial state to p’s are 
then solutions of the problem. 

Indeed the solution comes without the use of table NB because values in the 
table do not need to be stored. We show how this is done for the instance of 
the problem with k = 2. One simply seeks a state (or all states), as deep as 
possible, that satisfies one of the two conditions: 


e at least two edges leave p, 


e an edge leaves p and p is a terminal state. 


State p is then a fork and it is found by a mere traversal of the automaton. 
Proceeding in this way, no preliminary treatment of 2(y) is necessary and ney- 
ertheless the linear computing time is preserved. One can note that the execu- 
tion time does not depend on the branching time in the automaton because no 
transition is executed, the search only traverses existing edges. 

The two descriptions above are summarized in the following proposition. 
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PROPOSITION 2.7.1. Given one of the automata G(y), A(y) or A°(y), comput- 
ing a longest repeated factor of y can be done in time and space O(|y]). rT 


The second problem deals with searching for a marker. A factor of y is so 
called when it marks a small number of positions on y. 


Marker Given an integer k, k > 1, find a shortest word having less than k 
occurrences in y. 


The use of a suffix automaton provides a solution to the problem of the same 
vein as that of the solution to the longest repetition problem. It amounts to 
find, in the automaton, a state that is as closest as possible to the initial state 
and that is the origin of less than k paths to a terminal state. Contrary to the 
above situation however, a state associated with a marker is not necessarily a 
fork, but this has no effect on the solution. Again, a simple traversal of the 
automaton solves the question, which gives the following result. 


PROPOSITION 2.7.2. Given one of the automata G(y), A(y) or A°(y), the com- 
putation of a marker in y can be carried out in time and space O(|y|). rT 


2.7.2. Forbidden words 


Searching for forbidden words is a reverse question to finding repetitions. It 
intervenes in the description of a certain type of text compression algorithms. 

A word u € A* is called a forbidden word in the word y € A* if it is not 
factor of y. And wu is called a minimal forbidden word if in supplement all its 
own proper factors are factors of y. In other words, the minimality relates to 
the ordering “is a factor of”. This concept is in fact more relevant than the 
preceding one. We denote by I(y) the set of minimal forbidden words in y. 

One can notice that 

u=ul0..k-1] € I(y) 


if and only if 
u is not a factor of y but ul0..& — 2] and u[l..& — 1] are factors of y, 
which results in the equality 
I(y) = (A; Fact(y)) N (Fact(y) - A) (A®* \ Fact(y)). 


The equality shows in particular that the language I(y) is finite. It is thus 
possible to represent I(y) by a trie in which the external nodes only are terminal 
because of the minimality of words. 

The algorithm FORBIDDENWORDS, which code is given below, built the trie 
accepting I(y) from the automaton 2(y). Figure 2.22 shows the example of the 
trie of forbidden words of aabbabb, obtained from the automaton of Figure 2.13. 
In the algorithm, the queue is used to traverse the automaton 2(y) in a width- 
first manner. 
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b 


Figure 2.22. Trie of minimal forbidden words of the word aabbabb on 
the alphabet {a, b,c}, such as it is built by algorithm FORBIDDEN. Non- 
terminal states are those of automaton 2(aabbabb) of Figure 2.13. Note 
that states 3 and 4 as well as the edges reaching them can be removed. 
The forbidden word babba, recognized by the tree, is minimal because babb 
and abba are factors of aabbabb. 


FORBIDDEN WoRDS(2(y)) 
1 M — NEWAUTOMATON() 


2 L<— EMPTYQUEUE() 
3. ENQUEUE(L, (initial(Ql(y)), initial(/))) 
4 while not FILEISEMptTy(L) do 
5 (p, p') — DEQUEUE(L) 
6 for each a € A do 
7 if TARGET(p, a) is undefined and 
(p = initial(2l(y)) or TARGET(f[p], a) is defined) then 
8 q’ — NEWSTATE() 
9 terminal(q’) — TRUE 
10 adj[p'] — adj[p'] U {(a,q')} 
11 elseif TARGET(p, a) is defined and 
TARGET(p, a)notreachedyet then 
12 q’ — NEWSTATE() 
13 adjlp!| — adjlp'| U {(a,@)} 
14 ENQUEUE(L, (TARGET(p, a), q’)) 


15 return MW 


PROPOSITION 2.7.3. For y € A*, the algorithm FORBIDDENWORDS produces, 
from the automaton A(y), a tree that accepts the language I(y). The execution 
can be done in time O(|y| x log Card A). 


Proof. It is noticed that edges created at line 13 duplicate the edges of the 
spanning tree of shortest paths of the graph of 2(y), because the automaton is 
traversed in width-first order (the queue L is aimed at that). Other edges are 
created at line 10 and are of the form (p’,a,q’) with p’,q’ € T’, denoting by 
T’ the set of terminal states of MM. Let us denote by 6’ the transition function 
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associated with the edges of M created by the algorithm. By construction, the 
word wu for which 6’(initial(M),u) = p’ is the shortest word that reaches the 
state p = d(initial(Q(y)), u) in Wy). 

We start by showing that any word recognized by the tree that the algorithm 
produces is a minimal forbidden word. Let ua be such a word, necessarily 
nonempty (u € A*, a € A). By assumption, the edge (p’,a,q’) was created at 
line 10 and qd’ € T’. If u = 6, we have p’ = initial(M) and we notice that, by 
construction, a ¢ alph(y); therefore ua is effectively a minimal forbidden word. 
If u # «, let us write it bu with b € A and v € A*. The state 


s = 6(initial(A(y)), v) 


satisfies s # p because both |v| < |u| and, by construction, u is the shortest 
word that satisfies p = d(initial(M),u). Therefore f{p] = s, by definition of 
suffix links. Then, again by construction, 6(s,a) is defined, which implies that 
va is a factor of y. The word wa = bva is thus minimal forbidden since bv, va 
are factors of y but ua is not a factor of y. 

It is then shown conversely that any forbidden word is recognized by the tree 
built the algorithm. Let ua such a word, necessarily nonempty (u € Fact(y), 
a € A). If u=e, the letter a does not appear in y, and thus 6(initial(2(y)), a) 
is not defined. The condition at line 7 is met and causes to create an edge which 
leads to the recognition of the word ua by the automaton M. If u # «, let us 
write it bv with b € A and v € A*. Let 


p = O(initial((y)), u). 


As v is a proper suffix of u and va is a factor of y while ua is not a factor of y, 
if we consider the state 
s = O(initial(Q(y)), v), 


we have necessarily p # s and thus s = f[p]| by definition of suffix links. The 
condition at line 7 is thus still satisfied in this case, and this has the same effect 
as above. In conclusion, wa is recognized by the tree created by the algorithm, 
which finishes the proof. r 


An unexpected consequence of the preceding construction is an upper bound 
on the number of minimal forbidden words in a word. 


PROPOSITION 2.7.4. A word y € A* of length |y| > 2 has no more than 
Card A+ (2|y| —3) x (Card alph(y) —1) minimal forbidden words. It has Card A 
of them if |y| < 2. 


Proof. According to the preceding proposition the number of minimal forbidden 
words in y is equal to the number of terminal states of the trie I(y), which is 
also the number of edges entering these states. 

There is exactly Card. A— a such edges coming out from the initial state, by 
noting a = Cardalph(y). There is at most @ outgoing edges from the unique 
state of 2(y) having no outgoing transition. From other states there is at most 
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a — 1 outgoing edges. Since, for |y| > 2, 2&(y) has at most 2|y| — 1 states 
(Proposition 2.4.1), we obtain 


CardI(y) < (Card A — a) + a+ (2|y| — 3) x (a— 1), 


which gives 
Card I(y) < Card A + (2|y| — 3) x (a — 1), 
as announced. 
Finally, we have I(c¢) = A and, for a € A, I(a) = (A\ {a})U{aa}. Therefore 
Card I(y) = Card A when |y| < 2. rT 


2.8. Pattern matching machine 


Suffix automata can be used like machines to locate occurrences of patterns. 
We consider in this section the suffix automaton 2(a) to implement the search 
for x (length m) in a word y (length n). The other structures, compact tree 
G(x) and compact automaton A°(«), can be used as well. 

The searching algorithm rests on considering a transducer with a failure 
function. The transducer computes sequentially the lengths ¢; defined below. 
It is built upon the automaton 2(a), and the failure function, used to cope 
with non-explicitly defined transitions of the searching automaton, is nothing 
else than the suffix link f defined on states of the automaton. The principle of 
the searching method is standard. The search is carried out sequentially along 
the word y. Adaptation and analysis of the algorithm with the tree G(x) are 
immediate although the suffix link function of this structure is not a failure 
function according to the precise meaning of this concept (see Exercise 2.2.4). 

The advantage brought by the algorithm on other methods based on failure 
functions lies in a bounded amount of time to treat a letter of y, together with 
a more direct analysis of its time complexity. The price for this improvement 
is a more important need of memory capacity intended to store the automaton 
instead of a simple table, although the space remains linear. 


2.8.1. Lengths of common factors 


The search for x is based on computing lengths of factors of x appearing at any 
position on y. More precisely, the algorithm computes, at any position 7 on y, 
0<i<n, the length 


é; = max{|u| | u € Fact(a) M Suff(y[0. .2])} 


of the longest factor of « ending at this position. The detection of occurrences 
of x follows the obvious remark: 


x occurs at position i — |x| +1 on y 
if and only if 
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a 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
yi] aaa bbbabboaaboboaob oo oe 
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Figure 2.23. Using the automaton 2(aabbabb) (see Figure 2.13), algo- 
rithm LENGTH-OF-FACTORS determines the factors common to aabbabb 
and y. Values ¢; and p; are the respective values of variables @ and p of 
the algorithm related to position i. At position 8 for example, fg = 5 
indicates that the longest factor of aabbabb ending there has length 5; it 
is bbabb; the current state is 7. An occurrence of the pattern is detected 
when ¢; = 7 = |aabbabb|, as it is at position 15. 


The algorithm which computes the lengths 0, 1,...,£n—1 is given below. It 
uses the table L, defined on states of the automaton (Section 2.4), to reset the 
length of the current factor, after a traversal through a suffix link (line 8). The 
correction of this instruction is a consequence of Lemma 2.3.6. A simulation of 
the computation is given in Figure 2.23. 


LENGTHSOFFACTORS(2 (x), y) 
1 (¢,p) — (0, initial(Q(x))) 
2 fori-0ton—1do 
if TARGET(p, y[i]) is defined then 
(¢,p) — (€+1, TARGET(p, y[i])) 
else do p< f[p| 
while p is defined and TARGET(p, y[i]) is undefined 
if p is defined then 
(¢,p) — (L[p| + 1, TaRGET(p, y[i])) 
else (¢,p) <— (0, initial(QU(x))) 
output ¢@ 


SCO OAN DOK W 


— 


THEOREM 2.8.1. The algorithm LENGTHSOFFACTORS applied to the automa- 
ton 2(x) and the word y (x,y € A*) produces the lengths lo, ¢1,...,€jy|-1- It 
makes less than 2|y| transitions in A(x) and runs in time O(|y| x log Card.A) 
and space O(|2|). 


Proof. The proof of correctness of the algorithm is done by recurrence on the 
length of prefixes of y. We show more exactly than the equalities 


l=; 


and 
p = O(initial(Q(x)), yi — @+1..%) 


are invariants of the for loop, by noting 6 the transition function of (x). 
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Let 7 > 0. The already-treated prefix has length 7 and the current letter is 
yli]. It is supposed that the condition is satisfied for i—1. Thus, u = y[i—@..i—1] 
is the longest factor of x ending at position i — 1 and p = d(initial(Q(x)), u). 

Let w be the suffix of length ¢; of y[0..7]. Let us first suppose w # ¢; 
therefore w rewrites vu - y[i] with v € A*. Note that v cannot be longer than u 
because this would contradict the definition of u, and thus v is a suffix of wu. 

If v = u, d(p,y[é]) is defined and provides the next value of p. Moreover, 
é; = €+1. These two points correspond to the update of (¢,p) carried out at 
line 4, which shows that the condition is satisfied for 7 in this situation. 

When v is a proper suffix of wu, we consider the greatest integer k, k > 0, 
for which v is a suffix of s*(u) where s, is the suffix function relative to x 
(Section 2.3). Lemma 2.3.6 implies that v = s*(u) and that the length of this 
word is L,,(q) where g = 0(initial(Ql(x)),v). The new value of p is thus d(q, y[i]), 
and that of @ is L,(q) +1. It is done so by the instruction at line 8, since f 
and L respectively implement the suffix function and the length function of the 
automaton, and according to Proposition 2.4.5 which establishes the relation 
with function s,. 

When w = ¢, this means that letter y[i] ¢ alph(«). It is thus necessary to 
re-initialize the pair (¢,p), which is done at line 9. 

Finally, it is noted that the proof is also valid for the treatment of the first 
letter of y, which finishes the proof of the invariant condition and proves the 
correctness of the algorithm. 

For the complexity, one notices that each transition done, successfully or 
not, leads to incrementing i or to strictly increasing the value of i— @. As 
each one of these two expressions varies from 0 to |y|, we deduces that the 
number of transitions done by the algorithm is no more than 2|y|. Moreover, as 
the execution time of all the transitions is representative of the total execution 
time, it is O(|y| x log Card A). 

The memory space necessary to run the algorithm is used mainly to store 
the automaton (x) which has size O(|2|) according to the theorem 2.4.4. This 
gives the last stated result, and finishes the proof. a 


2.8.2. Optimization of suffix links 


Since the algorithm LENGTHSOFFACTORS works in a sequential way, it is natu- 
ral to consider its delay, that is, the maximum time spent on a letter of y. One 
realizes immediately that it is possible to modify the suffix function in order to 
reduce this time. 

Optimization is based on sets of letters that label edges going out a state. 
We define, for p state of A(x), 


Next(p) = {a € A| d(p, a) is defined}. 
Then, the new suffix link f is defined, for p state of A(x), by the relation: 
{ flip] if Next(p) C Next(f\p]), 


flflp]] else, if this value is defined. 
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Note that the relation can leave the value of f{p| undefined. The idea of this 
definition comes from the fact that the link is used as a failure function: there 
is no need to go to f[p] if Next(f[p]) C Next(p). 

Note that in the automaton 2(x) one always has 


Next(p) C Next(f|[p]). 


So, we can reformulate the definition of f as: 


{ flp| if deg(p) 4 deg(flp)), 


flflp|]_ else, if this value is defined. 


The computation of i can thus be performed in linear time by considering 
outgoing degrees (deg) of states in the automaton. 

The optimization of the suffix link leads to a reduction of the delay of al- 
gorithm LENGTHSOFFAcTORS. The time can be evaluated as the number of 
executions of the instruction at line 5. We get the following result, which shows 
that the algorithm treat the letters of y in real time when the alphabet is fixed. 


PROPOSITION 2.8.2. When the algorithm LENGTHSOFFACTORS makes use of 
the suffix link f in place of f, the treatment of each letter of y takes a time 
O(Card alph()). 


Proof. The result is an immediate consequence of inclusions 


Next(p) C Next(f[p]) C A 


for each state p for which fp] is defined. rT 


2.8.3. Search for conjugates 


The sequence of lengths 0, 1,...,€n,—1 of the preceding section is a very rich 
information on resemblances between the words x and y. It can be exploited 
in various ways by algorithms comparing words. It authorizes for example an 
efficient computation of LCF(a,y), the maximum length of factors common to 
x and y. This is done in linear time on a bounded alphabet. This quantity 
intervenes for example in the definition of the distance between words: 


d(x, y) = |x| + |y| — 2LCF(a,y). 


We are interested in searching for conjugates (or rotation) of a word within 
a text. The solution put forward in this section is another consequence of 
the length computation described in the previous section. Let us recall that a 
conjugate of word x is a word of the form v- wu for which «= u-v. 


Searching for conjugates Let x € A*. Locate all the occurrences of conju- 
gates of x (of length m) occurring in a word y (of length n). 
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A first solution consists in applying a classical algorithm for searching a finite 
set of words after having built the trie of conjugates of x. The search time is 
then proportional to n (on a fixed alphabet), but the trie can have a quadratic 
size O(n”), as can be the size of the (non compact) suffix trie of a. 

The solution based on the use of a suffix automaton does not have this 
disadvantage while preserving an equivalent execution time. The technique is 
derived from the computation of lengths done in the preceding section. For this 
purpose, we consider the suffix automaton of the word x-x, by noting that every 
conjugate of x is a factor of x- x. One could even consider the word x-wA7! 
where w is the primitive root of x, but that does not change the following result. 


PROPOSITION 2.8.3. Let x,y € A*. Locating the conjugates of x in y can be 
done in time O(|y| x log Card.A) within a memory space O(|z|). 


Proof. We consider a variant of algorithm LENGTHSOFFACTORS that produces 
the positions of the occurrences of factors having a length not smaller than a 
given integer k. The transformation is immediate since at each stage of the 
algorithm the length of the current factor is stored in variable @. 

The modified algorithm is applied to the automaton 2(x?) and the word y 
with parameter k = |x|. The algorithm thus determines factors of length || of 
x” which appear in y. The conclusion follows, noting that factors of length || 


of x? are conjugate of x, and that all conjugates x appear in 27. ] 


The concept of index is strongly used in questions related to data retrieval 
techniques. One can refer to the book of Baeza-Yates and Ribero-Neto 1999 to 
go deeper into the subject, or to that of Salton 1989. Apostolico 1985 describes 
several algorithmic applications of suffix trees that applies often to other suffix 
structures. 

Personal searching systems, or indexes used by search engines, often use 
simpler techniques like the constitution of lexicons of rare words or k-grams 
(i.e., factors of length k) with & relatively small. 

The majority of topics covered in this chapter is classical in string algorith- 
mics. The book of Gusfield 1997 contains a good number of problems, and es- 
pecially those grounded on questions in computational molecular biology, whose 
algorithmic solutions rest on the use of data structures for indexes, including 
questions related to repetitions. 

Forbidden words of Section 2.7.2 are used in the DCA compression method 
of Crochemore, Mignosi, Restivo, and Salemi 2000. 

The use of suffix automata as searching machines is due to Crochemore 
1987. Using suffix trees for this purpose produces an immediate but less efficient 
solution (see exercise 2.2.4). 


Version June 23, 2004 


152 Structures for Indexes 

Problems 

Section 2.2 

2.2.1 Check that the execution of SUFFIXTREE(a") (a € A) takes a time 
O(n). Check that the execution time of SUFFIXTREE(y) is Q(n log n) 
when Card alph(y) = |y| =n. 

2.2.2. How many nodes are there in the compact suffix tree of a de Bruijn 
word? How many for a Fibonacci word? Same question for their com- 
pact and non compact suffix automata. 

2.2.3 Let T,1(y) be the compact trie that accepts the factors of word y that 
have a length ranging between the two natural integers k and (0 <k < 
é < |y|). Design an algorithm to build J,7(y) and that uses a memory 
space proportional to the size of the tree (and not O(|y|)) and that runs 
in the same asymptotic time as the construction of the suffix tree of y. 

2.2.4 Design an algorithm for the computation of LCF (2, y) (a, y € A*), max- 
imum length of factors common to x and y, based on the tree G(x-c-y), 
where c€ Aandc ¢ alph(x-y). What is the time and space complexity 
of the computation? Compare with the solution in Section 2.8. 

2.2.5 Give a bound on the number of cubes of primitive words occurring in a 
word of length n. Same question for squares. (Hint: use the suffix tree 
of the word.) 

2.2.6 Design an algorithm for the fusion of two suffix trees. 

2.2.7 Describe a linear time and space algorithm (on a fixed alphabet) for the 
construction of the suffix tree of a finite set of words. 

Section 2.8 

2.3.1 Let y be a word in which the last letter does not appear elsewhere. Show 
that ¥(y), the minimal deterministic automaton accepting the factors 
of y, has the same states and same edges as A(y) (only the terminal 
states differ). 

2.3.2 Give the precise number of states and edges in the factor automaton 
5(y). 

Section 2.4 

2.4.1 Design an on-line algorithm for the construction of the factor automaton 
$(y). The algorithm should run in linear time and space on a finite and 
fixed alphabet. 

2.4.2 Design a linear-time algorithm (on a fixed alphabet) for the construction 


of the suffix automaton of a finite set of words. 
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Section 2.5 


2.5.1 Describe an algorithm for constructing A°(y) from G(y). 

2.5.2 Describe an algorithm for constructing A°(y) from A(y). 

2.5.3 Write in details the code of the algorithm for the direct construction of 
AC(y). 

2.5.4 Design an on-line algorithm for constructing AC(y). 


Section 2.7 


2.7.1 Let k > 0 be an integer. Implement an algorithm, based on one of the 
automata of suffixes of y € A*, which determines factors that appear 
at least k& times in y. 

2.7.2. For y € A*, design an algorithm for computing the maximum length 
of factors of y which have two non-overlapping occurrences (i.e., if u is 
such a factor, it appears in y at two positions i and 7 such as i+ |u| < 7). 

2.7.3 It is said that a language M C A* avoids a word u € A* if u is not 
factor of any word of M. Let M be the language of words that avoid all 
the words of a finite set J C A*. Show that M is accepted by a finite 
automaton. Give an algorithm that builds an automaton accepting M 
given the trie of J. 

2.7.4 Design a construction of the automaton §(y) given the trie of forbidden 
words I(y). 


Section 2.8 


2.8.1 Provide an infinite family of words for which each word has a trie of its 
conjugates that is of quadratic size. 

2.8.2 Design an algorithm for locating conjugates of x in y (a, y € A*), given 
the tree G(a-a-c-y), where c € A and c ¢ alph(a- y). What is the 
complexity of the computation? 


Notes 


The concept of position tree is due to Weiner 1973 who presented an algorithm 
to compute its compact version. The algorithm of Section 2.2 is from McCreight 
1976. A strictly sequential version of the suffix tree construction was described 
by Ukkonen 1995. 

For questions referring to formal languages, like concepts of syntactic con- 
gruences and minimal automata, one can refer to the books of Berstel 1979 and 
Pin 1986. 

The suffix automaton of a text with unmarked terminal states is also known 
as the suffix DAWG, Directed Acyclic Word Graph. Its linearity was discovered 
by Blumer, Blumer, Ehrenfeucht, Haussler, and McConnel 1983 who gave a 
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linear construction of it on a fixed alphabet (see also Blumer et al. 1985). The 
minimality of the structure as an automaton is due to Crochemore 1986, who 
showed how to build within the same complexity the factor automaton of a text 
(see exercises 2.3.1, 2.3.2 and 2.4.1). 

The notion of compact suffix automaton appears in Blumer, Ehrenfeucht, 
and Haussler 1989. An algorithm for compacting suffix automata, as well as 
a direct construction of compact suffix automata, is presented in Crochemore 
and Vérin 1997. An on-line construction of compact suffix automata has been 
designed by Inenaga, Hoshino, Shinohara, Takeda, Arikawa, Mauri, and Pavesi 
2001. 

For the average analysis of sizes of the various structures presented in the 
chapter one can refer to Szpankowski 1993b and to Jacquet and Szpankowski 
1994, who corrected a previous analysis by Blumer et al. 1989, extended by Raf- 
finot 1997. These analyses rely on methods described in the book of Sedgewick 
and Flajolet 1995. 

On special integer alphabets, Farach 1997 has designed a linear time con- 
struction of suffix trees. 

Indexes can also be realized efficiently with the use of suffix arrays. This 
data structure may be viewed as an implementation of a suffix tree. The notion 
has been introduced by Manber and Myers 1993 who designed the first efficient 
algorithms for its construction and use. On special integer alphabets, a suffix 
array can be built in linear time by three independent algorithms provided by 
Karkkainen and Sanders 2003, Kim, Sim, Park, and Park 2003, and Ko and 
Aluru 2003. 
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3.0. Introduction 


Fundamental notions of combinatorics on words underlie natural language pro- 
cessing. This is not surprising, since combinatorics on words can be seen as the 
formal study of sets of strings, and sets of strings are fundamental objects in 
language processing. 

Indeed, language processing is obviously a matter of strings. A text or a 
discourse is a sequence! of sentences; a sentence is a sequence of words; a word 
is a sequence of letters. The most universal levels are those of sentence, word 
and letter (or phoneme), but intermediate levels exist, and can be crucial in 


lIn this chapter, we will not use the term “word” to denote a sequence of symbols, in order 
to avoid ambiguity with the linguistic meaning. 
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some languages, between word and letter: a level of morphological elements 
(e.g. suffixes), and the level of syllables. The discovery of this piling up of 
levels, and in particular of word level and phoneme level, delighted structuralist 
linguists in the 20th century. They termed this inherent, universal feature of 
human language as “double articulation”. 

It is a little more intricate to see how sets of strings are involved. There are 
two main reasons. First, at a point in a linguistic flow of data being processed, 
you must be able to predict the set of possible continuations after what is 
already known, or at least to expect any continuation among some set of strings 
that depends on the language. Second, natural languages are ambiguous, i.e. a 
written or spoken portion of text can often be understood or analyzed in several 
ways, and the analyses are handled as a set of strings as long as they cannot 
be reduced to a single analysis. The notion of set of strings covers the two 
dimensions that linguists call the syntagmatic axis, i.e. that of the chronological 
sequence of elements in a given utterance, and the paradigmatic axis, i.e. the 
“or” relation between linguistic forms that can substitute for one another. 

The connection between language processing and combinatorics on words 
is natural. Historically, linguists actually played a part in the beginning of the 
construction of theoretical combinatorics on words. Some of the terms in current 
use originate from linguistics: word, prefix, suffix, grammar, syntactic monoid... 
However, interpenetration between the formal world of computer theory and 
the intuitive world of linguistics is still a love story with ups and downs. We 
will encounter in this chapter, for example, terms that specialists of language 
processing use without bothering about what they mean in mathematics or in 
linguistics. 

This chapter is organized around the main levels of any language modeling: 
first, how words are made from letters; second, how sentences are made from 
words. We will survey the basic operations of interest for language processing, 
and for each type of operation we will examine the formal notions and tools 
involved. 


3.1. From letters to words 


All the operations in the world between letters and words can be collectively 
denoted by the term “lexical analysis”. Such operations mainly involve finite 
automata and transducers. Specialists in language processing usually refer to 
these formal tools with the term “finite-state” tools, because they have a finite 
number of states. 


3.1.1. Normalization of encoding 


The computer encoding of the 26 letters of the Latin alphabet is fairly stan- 
dardized. However, almost all languages need additional characters for their 
writing. European languages use letters with diacritics: accents (é, é), cedilla 
(¢), tilde (7), umlaut (ti)... There are a few ligatures, the use of some of them 
being standard in some conditions: @&, oc, #, others are optional variants: ff, 
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fl. The encoding of these extensions of 7-bit ASCII is by no means normalized: 
constructors of computers and software editors have always tended to propose 
divergent encodings in order to hold users captive and so faithful. Thus, é 
is encoded as 82 and 8E in two common extended ASCII codes, as 00E9 in 
UCS-2 Unicode, as C3A9 in UTF-8 Unicode, and named “&eacute;” by ISO 
8879:1986 standard. The situation of other alphabets (Greek, Cyrillic, Korean, 
Japanese...) is similar. The encoding systems for the Korean national writing 
system are based on different levels: in KSC 5601-1992, each symbol represents 
a syllable; in “n-byte” encodings, each symbol represents a segment of a syllable, 
often a phoneme. 

Thus, generally speaking, when an encoding is transliterated into another, 
a symbol may be mapped to a sequence of several symbols, or the reverse. 
Transliteration implies (i) cutting up input text into a concatenation of seg- 
ments, and (ii) translating each segment. Both aspects depend on input and 
output encodings. 

Transliteration is simple whenever it is unambiguous, i.e. when source en- 
coding and target encoding convey exactly the same information in two different 
forms. The underlying formal objects are very simple. The set of possible seg- 
ments in input text is a finite code (the input code). It is often even a prefix 
code, i.e. no segment is a prefix of another. Here is an example of an input code 
that is not prefix: consider transliterating a phoneme-based Korean encoding 
into a syllable-based encoding. A 5-symbol input sequence kilto must be seg- 
mented as kil/to in order to be translated into a 2-symbol output sequence, but 
kilo must be segmented as ki/lo. 

In any case, encodings are designed so that transliteration can be performed 
by a sequential transducer. 

For the reader’s convenience, we will recall a few of the definitions of sec- 
tion 1.5. A finite transducer over the alphabets A, B is a finite automaton 
in which all edges have an input label uw € A* and an output label v € B*. 
The input alphabet A can be different from the output alphabet B, but they 
frequently have a nonempty common subset. The notation we will use is conve- 
nient when a transducer is considered as an automaton over a finite alphabet of 
the form X C A* x B*, as in section 3.1.5, and when we define a formal notion 


of alignment, as in section 3.1.7. Elements of X will be denoted (u:v) or @ 


as in Fig. 3.1; edges will be denoted (p, u:v,q). The label of a successful path 
of a transducer consists of a pair of sequences (w:x) € A* x B*. Corresponding 
input and output sequences may be of different lengths in number of symbols, 
and some of the edges may have input and output labels of different lengths. A 
transducer over A and B is input-wise literal if and only if all input labels are in 
Ale, and input-wise synchronous if and only if they are in A. The set of labels of 
successful paths of a transducer is the transduction realized by the transducer. 
A transduction over A and B is a relation between A* and B*. A transduction 
over A and B can be specified by a regular expression in the monoid A* x B* 
if and only if it is realized by a finite transducer. 

A sequential transducer is a finite transducer with additional output labels 
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attached to the initial and terminal states, and with the following properties: 

e it has at most one initial state, 

e it is input-wise synchronous, 

e for each state p and input label a C A, there is at most one edge (p,a: 

u,q) € E. 

The output string for a given input string is obtained by concatenating the 
initial output label, the output label of the path defined by the input string, 
and the terminal output label attached to the terminal state that ends the 
path. With a sequential transducer, input sequences can be mapped into output 
sequences through input-wise deterministic traversal. All transductions realized 
by sequential transducers are word functions. Sequential transducers can be 
minimized (cf. section 1.5.2). 

In practice, the output labels attached to terminal states are necessary for 
transliteration when input code is not prefix. The second and third properties 
above are obtained by adapting the alignment between input labels and output 
labels, i.e. by making them shorter or longer and by shifting parts of labels be- 
tween adjacent edges. Fig. 3.1 shows a sequential transducer that transliterates 
é and é from their ISO 8879 names, “&eacute;” and “&egrave;”, to their codes 
in an extended ASCII encoding, 82 and 8A. 


Figure 3.1. A sequential transducer that substitutes “82” for “&eacute;” 
and “8A” for “&egrave;”. 


The number of edges of transducers for normalization of character encoding 
is of the same order of magnitude as the sum of the lengths of the elements 
of the input code, say 30 if only letters are involved and 3000 if syllables are 
involved. 

Transliteration from one encoding to another is ambiguous when the target 
system is more informative than the source system. For example, 7-bit ASCII 
encoding, frequently used in informal communication, does not make any dif- 
ference between e and é, or between oe and the ligature cw. In a more elaborate 
encoding, these forms are not equivalent: c@ is not a free variant for oe; it can 
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be used in ceeur but not in coeriste. Transliteration from 7-bit ASCII to an ex- 
tended ASCII encoding involves recognizing more complex linguistic elements, 
like words. It cannot be performed by small sequential transducers. 

The situation is even more complex in Korean and Japanese. In these lan- 
guages, text can be entirely written in national writing systems, but Chinese 
characters are traditionally substituted for part of it, according to specific rules. 
In Japan, the use of Chinese characters in written text is standard in formal 
communication; in Korea, this traditional substitution is not encouraged by the 
authorities and is on the waning. Let us consider text with and without Chi- 
nese characters as two encodings. The version with Chinese characters is usually 
more informative than the one without: when a word element is ambiguous, it 
may have several transcriptions in Chinese characters, according to its respec- 
tive meanings. However, the reverse also happens. For instance, an ambiguous 
Chinese character that evokes “music”, “pleasure” or “love” in Korean words 
is pronounced differently, and transcribed ak, lak, nak or yo in the national 
writing system, depending on the words in which it occurs. 


3.1.2. Tokenization 


The first step in the processing of written text is helped by the fact that words 
are delimited by spaces. During Antiquity, this feature was exclusive to unvow- 
elled script of Semitic languages; it developed in Europe progressively during 
the early Middle Ages (Saenger, 1997) and is now shared by numerous languages 
in the world. 

Due to word delimitation, a simple computer program can segment written 
text into a sequence of words without recognizing them, e.g. without a dictio- 
nary. This process is called tokenization. Once it has been performed, words 
become directly available for further operations: statistics, full text indexation, 
dictionary lookup... 

The formal basis of delimiter-based tokenization is the unambiguous use of 
certain characters as delimiters. 

The alphabet of letters, A, and the alphabet of delimiters, D, are disjoint. A 
text is a sequence of letters and delimiters. After tokenization, it is a sequence 
of tokens. Word tokens are maximal occurrences of elements of A* in the text. 
Delimiter tokens can be defined either as single delimiters: 


Why/?/ /1/./ /Because/ /of/ /temperature/. 


or as sequences of delimiters: 


Why/? 1. /Because/ /of/ /temperature/. 


Some symbols, like dash (-) and apostrophe (’) in English, can be considered 
either as letters or as delimiters. In the first case, trade-off and seven-dollar 
are tokens; otherwise they are sequences of tokens. In any case, tokenization 
can be performed by simulating the two-state automaton of Fig. 3.2, and by 
registering a new word token whenever control shifts from state 1 to state 0. 
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Figure 3.2. An automaton for written text tokenization. 


In this section, we used the term “word” in its everyday sense; I would even 
say in its visual sense: a word in written text is something visibly separated by 
spaces. However, this naive notion of word does not always give the best results 
if we base further processing on it, because visual words do not always behave 
as units conveying a meaning. For example white does in white trousers, but 
not in white wine. We will return to this matter in section 3.1.4. 

Delimiter-based tokenization is not applicable to languages written without 
delimitation between words, like Arabic, Chinese or Japanese. In these lan- 
guages, written text cannot be segmented into words without recognizing the 
words. The problem is exactly the same with spoken text: words are not audibly 
delimited. 

However, in some cases, another type of tokenization consists in identifying 
all the positions in the text where words are liable to begin. These positions cut 
up text into tokens. After that, words can be recognized as certain sequences 
of tokens. For instance, in Thai language, words can only begin and end at 
syllable boundaries, and syllable boundaries cannot be preceded or followed by 
any patterns of phonemes. These patterns can be recognized by a transducer. 


3.1.3. Zipf’s law 


During the tokenization of a text or of a collection of texts, it is easy to build 
the list of all the different tokens in the text, to count the occurrences of each 
different token, and to rank them by decreasing number of occurrences. What 
is the relation between rank r and number of occurrences n,? Zipf observed 
that the following law is approximately true: 


Np =n1/r° (3.1.1) 


with a = 1. As a matter of fact, there are few frequent tokens, and many 
infrequent tokens. In experiments on French text, 1 token out of 2 was found 
to belong to the most frequent 139 tokens. In fact, for 20 < r < 2000, n; isa 
little higher than predicted by (3.1.1). 

Several equations can be derived from Zipf’s law. The number r,, of different 
tokens that occur at least n times is such that n = n1/r?, so: 


(myn 
Tm =\— 
n 
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The number of different tokens that occur between n and n+ 1 times is: 


ee. an 


For large values of n and a = 1, this is approximately n,/n, which is confirmed 
experimentally. 
According to (3.1.2), the number of tokens that occur once (hapaxes) is 


proportional to ny! “~ It is easy to observe that the number of occurrences of 
a very frequent token is approximately proportional to the size of the text, i.e. 
n/N depends on the language but not on the text. This means that all texts 
comprise roughly the same proportion of hapaxes. 

Can Zipf’s law be used to predict the relation between the size of a text and 
the size of its vocabulary? The size of the text is the total number of occurrences 
of tokens, 

N=nt+not...+npR 


where R is the size of the vocabulary, i.e. the number of different tokens. With 
a= 1, we have: 


R 
N= nm) 1/reminR 

r=) 
However, the relation between N and nj, in this equation is not confirmed ex- 
perimentally. Firstly, m1 is proportional to N. Secondly, the growth of R 
with respect to N tends to slow down, because of the tokens that occur again, 
whereas this equation implies that it would speed up. Thirdly, if this law were 
accurate, R would grow unbounded with N, which means that the vocabulary 
of a language would be infinite. What is surprising and counter-intuitive is that 
a steady growth of R with respect to N is maintained for texts up to several 
million different tokens. 

In other words, Zipf’s law correctly predicts that a collection of texts needs to 
be very large and diverse to encompass the complete vocabulary of a language, 
because new texts will contain new words for a very long time. Experience 
shows, for example, that the proportion of vocabulary which is shared by one 
year’s production of a newspaper and another year’s production is smaller than 
simple intuition would suggest. 


3.1.4. Dictionary compression and lookup 


Most operations on text require information about words: their translation into 
another language, for example. Since such information cannot in general be 
computed from the form of words, it is stored in large databases, in association 
with the words. Information about words must be formal, precise, systematic 
and explicit, so that it can be exploited for language processing. Such informa- 
tion is encoded into word tags or lexical tags. Examples of word tags are given 
in Fig. 3.3. The tags in this figure record only essential information: 
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fit fit A 

fit fit N:s 

fit fit V:W:P1s:P2s:P1p:P2p:P8p 
fitter fit A:C 

fitting _ fit V:G 


hop hop N:s 

hop hop V:W:P1s:P2s:P1p:P2p:P8p 
hope hope N:s 

hope hope V:W:P1s:P2s:P1p:P2p:P3p 
hoping hope V:G 

hopping hop V:G 

hot hot A 

hot air hot air N:s 

hotter hot A:C 

open open A 

open open N:S 

open open V:W:P1s:P2s:P1p:P2p:P8p 
open air open air N:S 


Figure 3.3. The word tags for a few English words. 


e the lemma, which is the corresponding form with default inflectional fea- 
tures, e.g. the infinitive, in the case of verbs, 

e the part of speech: A, N, V..., 

e the inflectional features. 
Lemmas are necessary for nearly all applications, because they are indexes to 
properties of words. If all the vocabulary is taken into account, the tag set used 
in Fig. 3.3 has many thousands of elements, due to lemmas. Size of tag sets is 
a measure of the informative content of tags. 

The operation of assigning tags to words in a text is called lexical tagging. 
It is one of the main objectives of lexical analysis. The reverse operation is 
useful in text generation: words are first generated in the form of lexical tags, 
then you have to spell them. In many languages, it is feasible to construct a 
list of roughly all words that can occur in texts. Such a list, with unambiguous 
word tags, is called an electronic dictionary”, or a dictionary. The strange term 
“full-form dictionary” is also in use. An electronic dictionary is in the order of 
a million words. Such a list is always an approximation, due to the fact that 
new words continuously come into use: proper nouns, foreign borrowings, new 
derivatives of existing words... 


?The term “electronic dictionary” emphasizes the fact that entries are designed for pro- 


grams, whereas the content of “conventional dictionaries” is meant for human readers, no 
matter whether they are stored on paper or on electronic support. 
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In inflectional languages like English, the construction of an electronic dictio- 
nary involves generating inflected forms, like conjugated verbs or plurals. This 
operation is usually carried out with tables of suffixes, prefixes or infixes, or 
with equivalent devices. 

What is considered as a word is not always clear, because words sometimes 
appear as combinations of words, e.g. hot air “meaningless talk”, open air “out- 
doors space”, white wine, which are called compound words. The situation is 
less clear with numerals, e.g. sixty-nine: linguistically, each of them is equivalent 
to a determiner, which is a word; technically, if we include them in the dictio- 
nary, they are another million words; syntactically, they are made of elements 
combined according to rules, but these rules are entirely specific to numerals and 
are not found anywhere in the syntax of the language. The status of such forms 
and of other examples like dates is not easy to assign. If they are considered as 
words, then the simplest form of description for them is a finite automaton. We 
will refer to such automata in section 3.2.2 by the term “local grammars”. 

The most repetitive operation on an electronic dictionary is lookup. The 
input of this operation is word forms, and the output, word tags. Natural and 
efficient data structures for them are tries, with output associated to leaves, and 
transducers. In both cases, lookup is done in linear time with respect to the 
length of the word, and does not depend on the size of the dictionary. 

Consider representing the dictionary in the form of a transducer. The dic- 
tionary is viewed as a finite set of word form/word tag pairs, i.e. a transduction. 
Alignment between input and output is based on the similarity between word 
forms and the lemmas included in word tags. This transduction is not a word 
function, since many word forms in a dictionary are associated with several word 
tags, like fit in Fig. 3.3: 


The shoes are fit for travel 
Maz had a fit of fever 
These shoes fit me 


Due to this universal phenomenon, known as lexical ambiguity or homogra- 
phy, the transduction cannot be represented by a sequential transducer. A p- 
sequential transducer is a generalization of sequential transducers with at most 
p terminal output strings at each terminal state. A p-sequential transducer for 
the words in Fig. 3.3 is shown in Fig. 3.4. In this transducer, the symbol # 
stands for a space character. The notion of p-sequential transducer allows for 
representing a transduction that is not a word function without resorting to an 
ambiguous transducer. A transducer is ambiguous if and only if it has distinct 
paths with the same input label. In a p-sequential transducer, there are no 
distinct paths with the same input label; any difference between output labels 
of the same path must occur in terminal output strings. 

In order to make the transducer p-sequential, lexically ambiguous word forms 
must be processed in a specific way: any difference between the several word 
tags for such a word form must be postponed to terminal output strings, by 
shifting parts of labels to adjacent edges. This operation may change the natu- 
ral alignment between input and output, and increase the number of states and 
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.V:W:P1s:P2s:P1p:P2p:P3p 
_N:s 


.V:W:P1s:P2s:P1p:P2p:P3p 


Figure 3.4. A p-sequential transducer for the words and tags in Fig. 3.3. 


edges of the transducer, but the increase in size remains within reasonable pro- 
portions because inflectional suffixes are usually short. After this operation, a 
variant of algorithm TOSEQUENTIALTRANSDUCER (section 1.5) can be applied. 

A dictionary represented as a transducer can be used to produce a dictionary 
for generation, by swapping input and output. The resulting transducer can be 
processed so as it becomes p-sequential too, provided that the dictionary is 
finite. 

Fig. 3.5 shows an approximation of the preceding transducer by an acyclic 
automaton or DAWG. Most of the letters in the word form are identical to 
letters in the lemma and are not explicitly repeated in the output. The end 
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of the output is shifted to the right and attached to terminal states, with an 
integer indicating how many letters at the end of the word form are not part 
of the lemma. When several output strings are possible for the same word, 
they are concatenated and the result is attached to a terminal state. During 
minimization of the DAWG, terminal states can be merged only if the output 
strings attached to them are identical. For the tag set used in Fig. 3.3, and 
for all the vocabulary, there are only about 2000 different output strings. The 
practical advantage of this solution is that output strings are stored in a table 
that need not be compressed and is easy to search for word tags. 


0. V:W:P1s:P2s:P1p:P2p:P3p 


0.V:W:P1s:P2s:P1p:P2p:P3p 
0.N:s 


0. V:W:P1s:P2s:P1p:P2p:P3p 


Figure 3.5. The DAWG for the words and tags in Fig. 3.3. 


In the previous figures, we have presented the same dictionary in different 
forms. The form containing most redundancy is the list (Fig. 3.3): parts of 
words are repeated, not only in lemmas and inflected forms, but also across 
different entries. The DAWG (Fig. 3.5) is virtually free of this redundancy, but 
it is unreadable and cannot be updated directly. In fact, linguistic maintenance 
must be carried out on yet another form, the dictionary of lemmas used to 
generate the list of Fig. 3.3. The dictionary of lemmas is readable and presents 
little redundancy, two fundamental features for linguistic maintenance. But the 
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only way to exploit it computationally is to generate the list — a form with huge 
redundancy — and then the DAWG. The flexibility of finite automata is essential 
to this practical organization. 

The main difficulties with dictionary-based lexical tagging are lexical lacu- 
nae, errors and ambiguity. 

Lexical lacunae, i.e. words not found in a dictionary, are practically impossi- 
ble to avoid due to the continuous creation and borrowing of new words. Simple 
stopgaps are applicable by taking into account the form of words: for example, 
in English, a capitalized token not found in the dictionary is often a proper 
noun. 

Lexical errors are errors producing forms which do not belong to the vocab- 
ulary of the language, e.g. coronre for coroner®. Lexical errors are impossible to 
distinguish from lexical lacunae. A few frequent errors can be inserted in dictio- 
naries, but text writers are so creative that this solution cannot be implemented 
systematically. In order to deal with errors (find suggestions for corrections, re- 
trieve lexical information about correct forms), an electronic dictionary can be 
used. By looking up in an error-tolerant way, we find correct forms that are 
close to the erroneous form. 

Lexical ambiguity refers to the fact that many words should be assigned 
distinct tags in relation to context, like fit. About half the forms in a text are 
lexically ambiguous. Lexical ambiguity resolution is dealt with in section 3.2.4. 

In some languages, sequences of words are written without delimiter in cer- 
tain conditions, even if the sequence is not frozen. In German, ausschwimmen 
“to swim out” is the concatenation of aus “out” and schwimmen “swim”. Ob- 
viously, dictionary lookup has to take a special form in cases where a token 
comprises several words. 

Performing the lexical analysis of a text with a set of dictionaries requires 
adapted software, like the open-source system Unitex. Fig. 3.6 shows the result 
of the lexical analysis of an English text by Unitex. This system can also be 
used for the management of the dictionaries in their different forms, and for the 
operations on words that we will present in section 3.2. 


3.1.5. Morphological analysis 


Given a word in a written text, represented by a sequence of letters, how do you 
analyse it into a sequence of underlying morphological elements? This prob- 
lem is conveniently solved by the dictionary methods of the preceding section, 
except when the number of morphological elements that make up words is too 
large. This happens with agglutinative languages. English and other Indo- 
European languages are categorized as inflected languages. A few agglutinative 
languages are spoken in Europe: Turkish, Hungarian, Finnish, Basque... and 
many others are from all other continents. In such languages, a word is a con- 
catenation of morphological elements, usually written without delimiters*. For 
3Errors can also produce words which belong to the vocabulary, like corner. 


4When morphological elements are delimited by spaces, like in Sepedi, an African aggluti- 
native language, the problem of recognizing their combinations is quite different. 
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& Unitex 1.2 beta (release October 27, 2003) - current language is English 
Text DELA FSGraph Lexicon-Grammar Edit File Edition Windows Info 


‘|| C:\Unitex 1.2'English\Corpus'ivanhoe.snt 
2343 sentence delimiters, 186614 (9301 diff) tokens, 83776 (9275) simple forms, 25 (9) digits 


prove ono ai, Srta oe, Yrathe By =“eTeMme 5 
is wants be ministered to with all care---look to it, Oswald." And the steward le 
ft the banqueting hall to see the commands of his patron obeyed.{S} Merchant of V 
lenice.{5} OSWALD, RETURNING, whispered into the ear of his master, "It is a Jew, 
who calls himself Isaac of York;{5} is it fit I should marshall him into the hall 
2?" "Let Gurth do thine office, Oswald," said Wamba with his usual effrontery; {5S} 
"the swineherd will be a fit usher to the Jew." "St Mary," said the Abbot, crossi 
ng himself, "an unbelieving Jew, and admitted into this presence!" "A dog Jew," e 
ichoed the Templar, "to approach a defender of the Holy Sepulchre?” "By my faith," 
said Varba, "it would seem the Templars love the Jews' inheritance better than t 
hey do their company." "Peace, my worthy guests," said Cedric;{S} "my hospitality 
must not be bounded by your dislikes.{5} If Heaven bore with the whole nation of 
stiff-necked unbelievers for more years than a lay 
the presence of one Jew for a few hours.{S} But I eq 

to feed with him.---Let him have a board and a mors 


fish, .N+Conc:s 


fish, .V:W:Pis:P2s:Pip:P2p:P3p 


fit, .V:U:Pis:P2s:Pip:P2p:P3p 
fitted, .A 

fitted, fit.V:K:I1s:I2s:13s:1ip 
fitter, .N+Conce:s 


fitter, .N+Hum:s 
fitter, fit.A:c 
fitting 


DLC: 219 co 


as usual, .A+asd+21 

las was,.AtasVt+21 

lass's ears,ass's ear.N+NsN+21: 
lat a loss,.A+21 

banqueting hall, .N+XN+z1:s 
best friend, .N+XN+Hum+z1:s 
best man, .N+XN+Humtzi:s 
etter acquainted, .At+zl 
hetter feelings, .N+XN+z1l:ip 
bills of exchange,bill of exc 


Figure 3.6. Lexical analysis of an English text by Unitex. 


illprepared 
incontestible 


master 
made 
while 
down 
Templar 


seemed 
Boeut 
much 
may 
before 


example, the following Korean sequence, transliterated into the Latin alphabet: 


manasiés’takojocha “even that (he) met”, comprises 6 elements: 


e mana “meet” 


e si (honorification of grammatical subject) 


e 6s’ (past) 
e ta (declarative) 
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e ko “that” 

e jocha “even” 
and can be used in a sentence meaning “(The Professor) even (thought) that (he) 
met (her yesterday)”. The form of each element can depend on its neighbors, so 
each element has a canonical form or lemma and morphological variants. There 
are two types of morphological elements: stems, which are lexical entries, like 
“meet” in the Korean example, and grammatical affixes, like tense, mood or case 
markers. Morphological analysis consists of segmenting the word and finding 
the lemma and grammatical tag of each underlying morphological element. The 
converse problem, morphological generation, is relevant to machine translation 
in case of an agglutinative target language: words are constructed as sequences 
of morphological elements, but you have to apply rules to spell the resulting 
word correctly. 

Finite transducers are usually convenient for representing the linguistic data 
required for carrying out morphological analysis and generation. For example, 
Fig. 3.7 represents a part of English morphology as if it were agglutinative. 
This transducer analyses removably as the combination of three morphologi- 
cal elements, remove. V, able.A and ly.ADV, and inserts plus signs in order to 
delimit them. The transducer roughly respects a natural alignment between 
written forms and underlying analyses. It specifies two types of information: 
how written forms differ from underlying forms, and which combinations of 
morphological elements are possible. Grammatical codes are assigned to mor- 
phological elements: verb, adjective, tense/mood suffix, adverb. Some other 
examples of words analyzed by this transducer are remove, removable, removed, 
removing, accept, acceptable, acceptably, accepted, accepting, emphatic, emphati- 
cally, famous and famously. The four initial states should be connected to parts 
of the dictionary representing the stems that accept the suffixes represented in 
the transducer. 

In this toy example, it would have been simpler to make a list of all suffixed 
forms with their tags. However, combinations of morphological elements are 
more numerous and more regular in agglutinative languages than in English, 
and they justify the use of a transducer. 

Transducers of this kind obviously have to be manually constructed by lin- 
guists, which implies the use of a convenient, readable graphic form, so that 
errors are easily detected and maintenance is possible. A widely used set of 
conventions consists in attaching labels to states and not to edges. States are 
not explicitly numbered. This graphic form is sometimes called a “graph”. For 
example, Fig. 3.8 shows the same transducer as Fig. 3.7 but with this presen- 
tation. The expressive power is the same. When the transducer is used in an 
operation on text or with another transducer, it is compiled into the more tra- 
ditional form. During this compilation, states are assigned arbitrary numbers. 

The main challenge with algorithmic tools for morphological processing is 
the need to observe two constraints: manually constructed data must be pre- 
sented in a readable form, whereas data directly used to process text must be 
coded in adapted data structures. When no format is simultaneously readable 
and adapted to efficient processing, the data in the readable form must be auto- 
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ed 
#ed.TM }’ 


( 


ing 
#ing.TM }’ 


ed ing 
#ted.TM })’\ #ing.TM }’ 


Figure 3.7. Morphological analysis in English. 


matically compiled into the operation-oriented form. This organization should 
not be given up as soon as operation-oriented data are available: linguistic main- 
tenance, i.e. correction of errors, inclusion of new words, selection of vocabulary 
for applications etc., can only be done in the readable form. 

Transducers for morphological analysis are usually ambiguous. This happens 
when a written word has several morphological analyses, like flatter, analyzable 
as flatter.V in Advertisements flatter consumers; and as flat.A+er.C in The 
ground is flatter here. The fact that transducers are ambiguous is not a prob- 
lem for linguistic description, since ambiguous transducers are as readable as 
unambiguous ones. However, it can raise algorithmic problems: in general, 
an ambiguous transducer cannot be traversed in an input-wise deterministic 
way. In inflected languages, this problem is avoided by substituting p-sequential 
transducers to ambiguous transducers, but this solution is no longer valid for 
most agglutinative languages. When ambiguity affects the first element in a long 
sequence of morphological elements, shifting output labels to terminal output 
strings would change the natural alignment between input and output to such 


Version June 23, 2004 


170 Symbolic Natural Language Processing 


Figure 3.8. Morphological analysis in English. 


an extent that the number of states and edges of the transducer would explode. 

Therefore, algorithm TOSEQUENTIALTRANSDUCER is not applicable: am- 
biguous transducers have to be actually used. There are several ways of auto- 
matically reducing the degree of input-wise nondeterminism of an ambiguous 
transducer. We will see two methods which can be applied after the alignment 
of the transducer has been tuned so as to be input-wise synchronous (see sec- 
tion 3.1.1). Both methods will be exemplified on the transducer of Fig. 3.8, 
which has 4 initial states. These distinct initial states encode dependencies 
between stems and suffixes, as we will see in the last page of this section. For 
simplicity’s sake, the stems are not included in this figure: thus, we will consider 
it as a collection of 4 transducers, and artificially maintain the 4 initial states. 

The first method consists in determinizing (algorithm NFATODFA, sec- 
tion 1.3.3) and minimizing (section 1.3.4) the ambiguous transducer, consid- 
ering it as an automaton over a finite alphabet X C A* x B*. In general, the 
resulting transducer is still ambiguous: distinct edges can have the same origin, 
the same input label, and distinct ends, (p,a:u,q) and (p,a:v,r), but only if 
their output labels u and v are distinct. The transducer of Fig. 3.9 is the result 
of the application of this method to the transducer of Fig. 3.8. Applying the 
resulting transducer to a word involves a variant of the nondeterministic search 
of section 1.3.2 (algorithm ISACCEPTED), but the search is quicker than with 
the original transducer, because algorithm NFATODFA reduces the nondeter- 
minism of the transducer. 
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Figure 3.9. An ambiguous transducer determinized as an automaton. 


In order to introduce the second method, we define a new generalization of 
p-sequential transducers. We will allow differences between output labels of the 
same path to occur at any place as long as they remain strictly local. Formally, 
a generalized sequential transducer is a finite transducer with a finite set of 
output labels I(¢) for the initial state i, a finite set of output labels T(q) for 
each terminal state q, and with the following properties: 

e it has at most one initial state, 

e it is input-wise synchronous, 

e for each pair of edges (p,a:u,q),(p,a:v,r) with the same origin and the 

same input label, qg= r. 

A transduction is realized by a generalized sequential transducer if and only 
if it is the composition of a sequential transduction with a finite substitution. 
Thus, such a transduction is not necessarily a word function: two edges can have 
the same origin, the same input label, the same end and distinct output labels, 
(p,a:u,q) and (p,a:v,q). However, given the input label of a path, a generalized 
sequential transducer can be traversed in an input-wise deterministic way, even 
if it is ambiguous. 

The second method constructs a generalized sequential transducer equivalent 
to the ambiguous transducer. When two edges with the same origin and the 
same input label have different output labels and different ends, output labels 
are shifted to adjacent edges to the right, but not necessarily until a terminal 
state is reached. The condition for ceasing shifting a set of output strings to 
the right is the following. Consider the set Ey.q of all edges with origin p and 
input label a. Each edge e € Ey has an output label ue € B* and an end 
de € Q. Consider the finite language Lp C B*Q over the alphabet BU Q 
defined by Lpa = {Uegele € Epa}. If we can write Lpa = MN with M Cc B* 
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and N c B*Q, then 
e create a new state r; let r be terminal if and only if at least one of the 
states qe is terminal; 
e substitute a new set of edges for Epa: the edges (p,a:v,r) for allu € M; 
e shift the rest of output labels further to the right by replacing each edge 
(de, 0: w, 8) with the edges (r,b: aw,s) for all « € N; for each terminal 
state among the states g-, substitute NT(q-) for T(qe). 
There can be several ways of writing Ly = MN: in such a case, the longer the 
elements of M, the better. 
If the transduction realized by the ambiguous transducer is finite, this algo- 
rithm terminates; otherwise it is not certain to terminate. If it does, we obtain 
an equivalent generalized sequential transducer like that of Fig. 3.10. 


Figure 3.10. A generalized sequential transducer. 


Transducers for morphological analysis like those of Fig. 3.7—3.10 can be 
used to produce transducers for morphological generation, by swapping input 
and output. The resulting transducer can be processed with the same methods 
as above in order to reduce nondeterminism. 

When observable forms and underlying lemmas are very different, the de- 
scription of morphology becomes complex. At the same time, it must still be 
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hand-crafted by linguists, which requires that it is made of simple, readable 
parts, which are combined through some sort of compilation. For example, if 
both morphological variations and combinatorial constraints are complex, they 
are better described separately. Combinatorial constraints between morpho- 
logical elements are described in an automaton at the underlying level, i.e. of 
lemmas and grammatical codes, as in Fig. 3.11. 


Figure 3.11. Combinatorial constraints between morphological elements. 


Morphological changes are described in a transducer, with input at the level 
of written text and output at the underlying level. This is done in Fig. 3.12, 
which is more complex than Fig. 3.8, but also more general: it allows for more 
combinations of suffixes, i.e. -7ngly, which was not included in Fig. 3.8 because 
it is not acceptable combined with remove. 

How can we use these two graphs for morphological analysis? There are two 
solutions. The simpler solution applies the two graphs separately. When we 
apply the transducer of Fig. 3.12 to a word, we obtain, in general, an automaton. 
The automaton has several paths if several analyses are possible, as with flatter. 
Then when we compute the intersection of this automaton with that of Fig. 3.11, 
this operation selects those analyses that obey the combinatorial constraints. 
The algorithm of intersection of finite automata is based on the principle that 
the set of states of the resulting automaton is the Cartesian product of the sets 
of states of the input automata. 

A more elaborate solution consists in performing part of the computation 
in advance. The automaton of Fig. 3.11 and the transducer of Fig. 3.12 do not 
depend on input text; they can be combined into the transducer of Fig. 3.8. If 
the automaton recognizes a set LZ and the transducer realizes a relation R, the 
operation consists in computing a transducer that realizes the relation R with 
its output restricted to L. This can be implemented, for instance, by applying 
algorithm COMPOSETRANSDUCERS (section 1.5) to the transducer of R and a 
transducer realizing the identity of L. Note that this algorithm is a variant of 
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ic.A #ly. ADV 


Figure 3.12. Morphological changes. 


the algorithm of intersection of finite automata. 

Morphological analysis and generation are not independent of the dictionary 
of stems: combinations of stems with affixes obey compatibility constraints, e.g. 
the verb fit does not combine with the suffix -able; stems undergo morphological 
variations, like remove in removable. Due to such dependencies, morphological 
analysis, in general, cannot be performed without vocabulary recognition. A 
dictionary of stems is manually created in the form of a list of many thousands 
of items and then compiled, so the interface with a transducer for morphologi- 
cal analysis requires practical organization. Combinatorial constraints between 
stems and affixes are represented by assigning marks to stems to indicate to 
which initial states of the automaton each stem must be connected. During 
compilation, the dictionary of stems and the automaton of combinatorial con- 
straints are combined into an automaton. Morphological variations of stems are 
taken into account in the transducer; if analogous stems behave differently in 
an unpredictable way, like fit/fitted and profit/profited, marks are assigned to 
stems and the transducer refers to these marks in its output. If these provisions 
are taken, the operation on the automaton of constraints and the transducer of 
variations can be performed as above and produces a satisfactory result. 
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In this case, the description is distributed over two data sets: an automaton 
and a transducer, and the principle of the combination between them is that the 
automaton is interpreted as a restriction on the output part of the transducer. 

It is often convenient to structure manual description in the form of more 
than two separate data sets: for example, one for the final e of verbs like remove, 
another for the final e of -able, another for variations between the forms -ly, -ly, 
-y of the adverbial suffix etc. This strategy can be implemented in three ways, 
depending on the formal principle adopted to combine the different elements 
of description: composition of transductions, intersection of transducers, and 
commutative product of bimachines. 


3.1.6. Composition of transductions 


The simplest of these three techniques involves the composition of transductions. 
Specialists in language processing usually refer to this operation by the bucolic 
term “cascade”. The principle is simple. The data for morphological analysis or 
generation consists of a specification of a transduction between input strings and 
output strings. This transduction can be specified with several transducers. The 
first transducer is applied to input strings, the next transducer to the output of 
the first, and so on. The global transduction is defined as the composition of 
all the transductions realized by the respective transducers. 

For example, Fig. 3.8 is equivalent to the composition of the transductions 
specified by Figs. 3.13-3.16. Fig. 3.13 delimits and tags morphological elements, 


Figure 3.13. A cascade: first transducer. 


but does not substitute canonical forms for variants. Fig. 3.14 inserts the final 
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e of the canonical form of remove. In Fig. 3.14, the input label @ stands for 


Figure 3.14. A cascade: second transducer. 


a default input symbol: it matches the next input symbol if, at this point of 
the transducer, no other symbol matches. The output label @ means an output 
symbol identical to the corresponding input symbol. Fig. 3.15 inserts the final 
e of the canonical form of -able. Fig. 3.16 assigns the canonical form to the 


Figure 3.15. A cascade: third transducer. 


variants of the adverbial suffix -ly. 


Figure 3.16. A cascade: fourth transducer. 
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During the application of a transducer, the input string is segmented accord- 
ing to the input labels of the transducer, and the output string is a concatenation 
of output labels. When transducers are applied as a cascade, the segmentation 
of the output string of a transducer is not necessarily identical to the segmen- 
tation induced by the application of the next. The global transduction is not 
changed if we modify the alignment of one of the transducers, provided that it 
realizes the same transduction. 

As an alternative to applying several transducers in sequence, one can pre- 
compute an equivalent transducer by algorithm COMPOSETRANSDUCERS, but 
the application of the resulting transducer is not necessarily quicker, depending 
on the number, size and features of the original transducers. 

The principle of composition of rules was implemented for the first time in... 
the 5th century B.C., in Panini’s Sanskrit grammar, in order to define Sanskrit 
spelling, given that the form of each element depends on its neighbors. 

Composition of relations is not a commutative operation. In our example of a 
cascade, the transductions of Figs. 3.14-3.16 can be permuted without changing 
the result of the composition, but they must be applied after Fig. 3.13, because 
they use the boundaries of morphological elements in their input, and these 
boundaries are inserted by the transduction of Fig. 3.13. In general, simple 
transductions read and write only in a few regions of a string, but interactions 
between different transductions are observed when they happen to read or write 
in the same region. 

The principle of defining a few levels in a determined order between the 
global input level and the global output level is often natural and convenient. 
The alphabet of each intermediate level is a subset of AU B. In morphological 
generation, the level of underlying morphological elements may have something 
to do with a previous state of the language, the sequence of levels being con- 
nected to successive periods of time in the history of language changes. 

However, in a language with complex morphological variations represented 
by dozens of rules, the exclusive use of composition involves dozens of ordered 
levels. This complicates the task of the linguist, because he has to form a mental 
image of each level and of their ordering. 

Intuitively, when two morphological rules are sufficiently simple and unre- 
lated, one feels that it should be possible to implement them independently, 
without even determining in which order they apply: hence the term “simul- 
taneous combination”. In spite of this intuition, rules cannot be formalized 
without specifying how they are interpreted in case of an overlap between the 
application sites of several rules (or even of the same rule): if rules apply to 
two sites uv and vw, the value of v taken into account for www can involve the 
input or the output level, or both. Various formal ways of combining formal 
rules have been investigated. Two main forms of simultaneous combination are 
presently in use. 
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3.1.7. Intersection of transducers 


The intersection of finite transducers can be used to specify and implement mor- 
phological analysis and generation. The alignment between input and output 
strings is an essential element of this model. This alignment must be literal, 
i.e. each individual input or output symbol must be aligned either with a single 
symbol or with ¢. Several alignments are usually acceptable, e.g. 


COV) G)G) ©) @) Gx) 
CO) G)G) ©) C) Gn) 


but one must be chosen arbitrarily. 

Formally, an alignment over A and B is a subset of the free monoid X%, 
where X is a finite subset of A* x B*. An alignment is literal if it is a subset of 
(Ale) x (Bl e))*. 

The alignment is determined in order to specify explicitly the set of all pairs 
(u:v) € (A | e)x(B |e) that will be allowed in aligned input/output pairs for all 
words of the language. Since all elements in the alignment will be concatenations 
of elements in this set, we can call it X. In the English example above, this set 
can comprise letters copied to output: 


(2) () (a) (2) -G) C2) @): 
(2) ()-(2)-G)-G)-G)-G): 


plus a few insertions: 


(<) Cr) (a) Cane) (Ca) Caer) 


and two deletions of letters: 
a l 
Ce) (&) 


The set of aligned input/output pairs for all words of the language is viewed as 
a language over the alphabet X. This language is specified as the intersection 
of several regular languages. Each of these languages expresses a constraint 
that all input/output pairs must obey, and the intersection of the languages 
is the set of pairs that obey simultaneously all the constraints. Since these 
regular languages share the same alphabet X C A* x B*, they can be specified 
by transducers over A and B. For example, the transducers in Figs. 3.17-3.20 
specify necessary conditions of occurrence for some of the elements of X. In 
Fig. 3.17, the label @ denotes a default symbol. It matches the next member 
of X if and only if no other label explicitly present at this point of the graph 
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Figure 3.17. Conditions of occurrence of (€:#). 


does. One of the states has no outgoing edge and is not terminal: it is a sink 
state which is used to rule out the occurrence of (¢:7#) when it is not preceded 
by (€:.A) or (€:.V). 


Figure 3.18. Conditions of occurrence of (€:e). 


In order to be complete, we should add transducers to specify the conditions 
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Figure 3.20. Conditions of occurrence of (a:#) and (I:e). 


of occurrence of (€:.V), (e:.TM), (€:.A) and (€:.ADV). 

The intersection of transducers is computed with the algorithm of intersec- 
tion of automata, considering transducers as automata over X. The resulting 
transducer checks all the constraints simultaneously. This operation of inter- 
section of transducers is equivalent to the intersection of languages in the free 
monoid X*, but not to the intersection of relations in A* x B*, because the 
intersection of relations does not take into account alignment. (In addition, an 
intersection of regular relations is not necessarily regular.) 

As opposed to the framework of composition of transductions, all the trans- 
ducers describe correspondences between the same input level and the same 
output level. This is why this model is called “two-level morphology”. Composi- 
tion of transductions and intersection of transducers are orthogonal formalisms, 
and they can be combined: several batches of two-level rules are composed in a 
definite order. 

Two-level constraints expressed as transducers are hardly readable, and ex- 
pressing them as regular expressions over X would be even more difficult and 
error-prone. In order to solve this problem of readability, specialists in two- 
level morphology have designed an additional level of compilation. Rules are 
expressed in a special formalism and compiled into transducers. These trans- 
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ducers are then intersected together. The formalism of expression of two-level 
rules involves logical operations and regular expressions over X. For example, 
the following rule is equivalent to Fig. 3.17: 


() = (4) )) —(2) @) @) (2) (4) 
Eee ee neler alen 
u enh 


This type of rule is more readable than a transducer, because it is structured 
in three separate parts: the symbol involved in the rule, here (¢: #), the left 
context (before ), and the right context. 

In this model, input and output are completely symmetrical: the same de- 
scription is adapted for morphological analysis and generation. 


3.1.8. Commutative product of bimachines 


A bimachine is structured in three parts: 

e a description of the left context required for the rule to apply, 

e a similar description of the right context, and 

e amapping table that specifies a context-dependent mapping of input sym- 

bols to output symbols. 

As opposed to two-level rules, left and right context are described only at input 
level. Fig. 3.21 is a representation of a bimachine that generates the variant 
-ally of the adverbial suffix -ly in emphatically. 


fF: al} 


Figure 3.21. Bimachine generating the variant -ally of the adverbial suffix -ly. 


In this figure, the automaton on the left represents the left context and 
recognizes occurrences of the sequence ic.A. Whenever this sequence occurs, 
the automaton enters state 3. In the automaton, the label @ represents a 
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default symbol: it matches the next input symbol if no other label at this 
point of the automaton does. The automaton on the right similarly recognizes 
occurrences of ly. ADV, but from right to left. Whenever this sequence occurs, 
the automaton enters state 4. The table specifies the mapping of input symbols 
to output symbols. The alphabets A and B have a nonempty common subset. 
In the table, @:@ represents a default mapping: any input symbol not explicitly 
specified in the table is mapped onto itself. The symbol # is mapped to al when 
its left and right context is such that the respective automata are in states 3 
and 4, i.e. when it is preceded by ic.A and followed by ly. ADV. Other symbols 
in such a context, and all symbols in other contexts, are copied to output. 
Thus, the bimachine maps occurrences of ic.A#ly.ADV to ic.Aally.ADV and 
leaves everything else unchanged. The input/output alignment that underlies 
the bimachine is always input-wise synchronous. 
Formally, a bimachine over alphabets A and B is defined by 


e two deterministic automata over A; let Q and Q be the sets of states of the 
two automata; the distinction between terminal vs. non-terminal states is 
not significant; 

e afunction 7: Q x Ax Q — B*, which is equivalent to the mapping table 
in Fig. 3.21. 

The transduction realized by a bimachine is defined as follows. One performs 
a search in the left automaton controlled by the input word u = uyu2--- Uy. If 
this search is possible right until the end of the word, a sequence Gon ee dn 
of states of the left automaton is encountered, where qo is the initial state. A 
similar search in the right automaton is controlled by uy-+-+ugu,. If the search 
can be completed too, states gn «+: an qo of the right automaton are encountered, 
where gp, is the initial state. 

The output string for the symbol u; of u is ¥(qi-13 Ui, di) and the output for 
u is the concatenation of these output strings. If one of the searches could not 
be completed, or if one of the output strings for the letters is undefined, then 
the output for u is undefined. 

A transduction is realized by a bimachine if and only if it is regular and a 
function. 

The use of bimachines for specifying and implementing morphological anal- 
ysis or generation requires that they can be combined to form complete descrip- 
tions. In the mapping table of Fig. 3.21, the default pair @:@ occurs in all four 
cases; the bimachine specifies an output string for some occurrences of #, and 
copies all other occurrences of input symbols. We will say that the bimachine 
“applies” to these occurrences of #, and “does not apply” to other occurrences 
of input symbols. In morphology, separate rules belonging to the same descrip- 
tion are complementary in so far as they do not “apply” to the same occurrences 
of input symbols. This idea can be used to define a notion of combination of 
bimachines over the same alphabets A and B. 

Formally, we say that a bimachine “applies” to an input symbol a in a 
given context, represented by two states q and q, if and only if (4,4, q) 
either is undefined or is not equal to a. It “does not apply” if and only if 


Version June 23, 2004 


3.1. From letters to words 183 


4(4, a, 7) =a. If two bimachines never apply to the same symbol in the same 
input sequence, a new bimachine over the same input and output alphabets A 
and B can be defined so that the output for a given input symbol is specified 
by the bimachine that applies. The output is a copy of the input symbol if 
none of them applies. (Each automaton of the new bimachine is constructed 
from the corresponding automata of the two bimachines, with the algorithm of 
intersection of automata.) This operation on bimachines is commutative and 
associative; its neutral element is a bimachine that realizes the identity of A. 
We call this operation “commutative product”. 

The commutative product of a finite number of bimachines is defined if and 
only if it is defined for any two of them. 

With this operation, linguists can manually construct separate bimachines, 
or rules, and combine them. These manually constructed rules must also be 
readable. This can be achieved by ensuring that the rules are presented accord- 
ing to the following conventions and have the following properties. 

e Final states are specified in the two automata. The content of the mapping 
table does not depend on the particular states reached when exploring the 
context, but only on whether these states are terminal or not. For example, 
in Fig. 3.21, states 3 and 4 would be specified as terminal. 

e In the mapping table, whenever at least one of the two states represent- 
ing the context is non-terminal, input symbols are automatically copied 
to output, as in Fig. 3.21. When both states are terminal, only the in- 
put/output pairs for which the output string is different from the input 
symbol are specified. Let I be the set of input symbols that occur in the 
input part of these pairs: if both states are terminal and the input symbol 
is in J, the rule applies; otherwise, it does not apply and input is copied 
to output. 

e The languages recognized by the two automata are of the form A*L and 
A*R, as in Fig. 3.21. Therefore, it suffices to specify Z and R; automata 
for A* LZ and A*R can be automatically computed. In addition, the mirror 
image of RF is specified instead of R itself, for the sake of readability. 

The bimachine of Fig. 3.21 has these properties and is represented with these 

conventions in Fig. 3.22. 


Figure 3.22. The bimachine of Fig. 3.21 with the conventions for man- 
ually constructed rules. 


This figure represents L, R and the input/output pairs for which the rule 
applies. These three parts are separated by the states labeled A. 

The commutative product of two rules is defined if and only if A*L, A* La, 
A*R, 1 A* Rg and I, M Ig are not simultaneously nonempty. This condition is 
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tested automatically on all pairs in a set of rules written to be combined by 
commutative product. If the three intersections are simultaneously nonempty 
for a pair of rules, the linguist is provided with the set of left contexts, right 
contexts and input symbols for which the two rules conflict, and he/she can 
modify them in order to resolve the conflict. (A hierarchy or priorities between 
rules would theoretically be possible but would probably make the system more 
complex and its maintenance more difficult.) 

The advantages of bimachines for specifying and implementing morphologi- 
cal analysis and generation are their readability and the fact that only differences 
between input and output need to be specified. 

Bimachines are equivalent to regular word functions and, in principle, cannot 
represent ambiguous transitions. They have to be adapted in order to allow for 
limited variations in output. Take, for example, the generation of the preterite 
of dream: for a unique underlying form, dream. V#ed.TM, where #ed.TM is 
an underlying tense/mood suffix, there are two written variants: dreamed and 
dreamt. Such variations are limited; in agglutinative languages, they can oc- 
cur at any point of a word, not necessarily just at the end. This problem is 
easily solved in the same way as we did for minimizing ambiguous transducers 
in section 3.1.5: by composition with finite substitutions. Bimachines realize 
transductions; several of these transductions can be composed in a definite order 
together or with finite substitutions. 

In the example of dream.V#ed.TM, the two variants can be generated by 
introducing 3 new symbols /, 2 and 3, and 

e a bimachine that produces dream. V#1ed. TM, 

e a finite substitution producing dream. V#2ed.TM and dream. V#3ed.TM, 

and 

e asecond bimachine that outputs dreamed for dream.V#2ed.TM and the 

variant dreamt for dream. V#3ed.TM. 

However, a bimachine is an essentially deterministic formalism. It is ade- 
quate for the direct description of morphological generation, because the under- 
lying level is more informative and less ambiguous than the level of written text: 
thus, for an input string at the level of underlying morphological elements, there 
will often be a unique output string or limited variations in output. For instance, 
flatter has two representations at the underlying level, but one spelling. 

It is possible to do morphological analysis with bimachines, but one has 
to carry out linguistic description for morphological generation, and automati- 
cally derive morphological analysis from it. The method consists in compiling 
each bimachine (or commutative product of bimachines) into a transducer, and 
swapping input and output in the transducer. During the compilation of a bi- 
machine into a transducer, the set of states of the transducer is constructed as 
the Cartesian product of the sets of states of the two automata. 


3.1.9. Phonetic variations 


Morphological analysis and generation of written text have an equivalent for 
speech: analysis and generation of phonetic forms. Phonetic forms are repre- 
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sented by strings of phonetic symbols. They describe how words are pronounced, 
taking into account contextual variants and free variants. An example of contex- 
tual phonetic variation in British English is the pronunciation of more, with r in 
more ice and without in more tea. Free variation is exemplified by can which can 
be either stressed or reduced in He can see. The input of analysis is thus a pho- 
netic representation of speech. The output is some underlying representation of 
pronunciation, which is either conventional spelling, or a specific representation 
if additional information is needed, such as grammatical information. 

The analysis of phonetic forms is useful for speech recognition. Their gen- 
eration is useful for speech synthesis. A combination of both is a method for 
spelling correction: generate the pronunciation(s) of a misspelled word, then 
analyze the phonetic forms obtained. 

A difference between phonetic processing and morphological processing is 
that a text can usually be pronounced in many ways, whereas spelling is much 
more standardized. In other aspects, the analysis and generation of phonetic 
forms is similar to morphological analysis and generation. The computational 
notions and tools involved are essentially the same. 

The complexity of the task depends on the writing systems of languages. 
When all information needed to deduce phonetic strings, including informa- 
tion about phonetic variants, is encoded in spelling, then phonetic forms can 
be derived from written text without any recognition of the vocabulary. This 
is approximately the case of Spanish. Most Spanish words can be converted 
to phonetic strings by transducers, two-level rules or bimachines that do not 
comprise lexical information. Fig. 3.23 converts the letter c into the phonetic 
symbol @ before the vowels e and i. 


Figure 3.23. A phonetic conversion rule in Spanish. 


In most of other languages, spelling is ambiguous: the pronunciation of a 
sequence of letters depends on the word in which it occurs in an unpredictable 
way. For example, ea between consonants is pronounced differently in bead, 
head, beatific, creation, react; in read, the pronunciation depends on the gram- 
matical tense of the verb; in lead, it depends on the part of speech of the word: 
noun or verb. Due to such dependencies, which are most frequent in English 
and in French, phonetic forms cannot be generated from written texts accurately 
without vocabulary recognition. In other words, phonetic conversion requires a 
dictionary, which can be implemented in the form of a transducer and adapted 
for quick lookup into a generalized sequential transducer like that of Fig. 3.10. 
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However, even in languages with a disorderly writing system like English 
or French, the construction of such a dictionary can be partially automated. 
Transducers, two-level rules or bimachines can be used to produce tentative 
phonetic forms which have to be reviewed and validated or corrected by linguists. 

A transducer that recognizes the vocabulary of a language is larger than a 
transducer that does not. They also differ in the way they delete word bound- 
aries. In many languages, words are delimited in written text; they are not in 
phonetic strings, because speech is continuous and there is no audible evidence 
that a word ends and the next begins. In a transducer that recognizes the 
vocabulary, edges that delete word boundaries, e.g. edges labelled (#:¢), can 
be associated with ends of words. When the transducer is reversed by swap- 
ping input and output, the resulting transducer not only converts phonetics 
into spelling but also delimits words. The same cannot be done in a transducer 
that does not recognize vocabulary: since certain edge(s) erase word boundaries 
independently of context, the reversed transducer will generate optional word 
boundaries everywhere. 

Phonetic strings are usually very ambiguous, and the result of their analysis 
consists of several hypotheses with different word delimitation, as in Fig. 3.24. 


Figure 3.24. Acyclic automaton of the analyses of a phonetic form. 


The result of the analysis of ambiguous input is naturally represented in 
an acyclic automaton like that of Fig. 3.24. We will call it an automaton of 
analyses, because it represents a set of mutually exclusive analyses. In language 
engineering, most specialists call such an automaton a “lattice”°. The output of 
a purely acoustic-to-phonetic phase of speech recognition is also an automaton 
of analyses: a segment of speech signal, i.e. the equivalent of a vowel or a 
consonant in acoustic signal, cannot always be definitely identified as a single 
phone (phonetic segment). 


5This term has a precise mathematical meaning: an ordered set where each pair has a 
greatest lower bound and a least upper bound. As a matter of fact, in an acyclic graph, edges 
induce an ordering among the set of states. But the ordered set of states of an acyclic graph is 
not necessarily a lattice in the mathematical sense. In the acyclic automaton of Fig. 3.24, for 
instance, cut has no greatest lower bound and new has no least upper bound. Consequently 
we will avoid using the term “lattice” for denoting automata of analyses. 
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3.1.10. Weighted automata 


The notions of automata and transducers exemplified in the preceding sections 
can be extended to weighted automata and transducers. In a weighted automa- 
ton, each transition has a weight which is an element of a semiring K; the set of 
terminal states is replaced with a terminal weight function from the set of states 
to K. The weight of a path is the product of the weights of its transitions. A 
Markov chain is a particular case of a weighted automaton. 

In such models, weights approximate probabilities of occurrence of symbols 
in certain contexts, and the semiring is often R*. For example, in an automaton 
of analyses which contains phones recognized in a speech signal, weights can be 
assigned to each transition in order to represent the plausibility of the phone 
given the acoustic signal. The weighted automaton is exploited by selecting the 
path that maximizes the product of the weights. 

Another example can be derived from Fig. 3.11: the plausibility of occurrence 
of a morphological element after a given left context could be added to this figure 
by assigning weights to boxes. The only known method of setting the value of 
these weights is based on statistics about occurrences of symbols or sequences 
in a sample of texts, a learning corpus. 

Weighted automata are also used to compensate for the lack of accurate 
linguistic data. Weights are assigned to transitions in function of observable 
hints as to the occurrence of specific linguistic elements. During the analysis of 
a text, the weights are used to recognize those elements. For example, an initial 
uppercase letter is a hint of a proper name; the word ending -ly is a hint of an 
adverb like shyly. Weights are derived from statistics computed in a learning 
corpus. Results are inferior to those obtained with word lists of sufficient lexical 
coverage, e.g. lists of proper names or of adverbs: for instance, bodily ends in 
-ly but is usually an adjective. Word lists tend to be more and more used, but 
the two approaches are complementary, and the weighted-automaton method 
can make systems more robust when sufficiently extensive word lists are not 
available. 


3.2. From words to sentences 


3.2.1. Engineering approaches 


The simplest model of the meaning of a text is the “word bag” model. Each 
word in the text represents an element of meaning, and the meaning of the text 
is represented by the set of the words that occur at least once in the text. The 
number of occurrences is usually attached to each word. The “word bag” model 
is used to perform tasks like content-based classification and indexation. 

In order to implement the same tasks in a more elaborate way, or to im- 
plement other tasks, the sequential order of words must be taken into account. 
Translation is an example of an operation for which word order is obviously 
relevant: in many target languages, The fly flies and The flies fly should be 
translated differently. A model of text for which not only the value of words, 
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but also their order, is relevant can be called a syntactic model. The formal and 
algorithmic tools involved in such a model depend entirely on the form of the 
linguistic data required. The most rational approach consists in constructing 
and using data similar to those mentioned in sections 3.1.4 to 3.1.9, but spec- 
ifying ordered combinations of words. These data take the form of manually 
constructed lists or automata; some of them are automatically compiled into 
forms more adapted to computational operations. This approach is a long-term 
one. The stage of manual construction of linguistic data implies even more skill 
and effort than in the examples of section 3.1 (From letters to words), basically 
because there are many more words than letters. In addition, engineers feel 
uneasy with such data, that are largely outside their domain of competence; 
linguists feel uneasy with the necessary formal encoding; and little of the task 
can be automated. A consequence of this situation is a lack of linguistic re- 
sources that has been widely recognized, since 1990, as a major bottleneck in 
the development of language processing. 

In order to avoid such work, alternative engineering techniques have been 
implemented and have had a dramatic development in recent years. The com- 
monest of these techniques rely on weighted automata. (They are the most pop- 
ular techniques based on weighted automata in language processing.) Weighted 
automata can be used to approximate various aspects of the grammar and syn- 
tax of languages: they can, for instance, guess at the part of speech of a word if 
the parts of speech of neighboring words are known. Weights are automatically 
derived from statistics about occurrences of symbols or sequences in a sample of 
texts, the learning corpus. The idea is similar to that with adverbs in -ly in sec- 
tion 3.1.10, but works even less well, for the same reason: there are more words 
than letters; there is a higher degree of complexity. As a matter of fact, in com- 
plex applications like translation and continuous speech recognition, results are 
still disappointing. Algorithms are well-known, but weights must be learnt for 
all words, and the only way of obtaining weights producing satisfactory results 
implies 

e numerous occurrences of each word; therefore very large learning corpora 

(cf. section 3.1.1 about Zipf’s law), 

e statistics about sufficiently large contexts, 

e sufficiently fine-grained tag sets. 

The first constraint correctly predicts that if the learning corpus is too small, 
results are inadequate. When the size of the learning corpus increases, perfor- 
mances usually reach a maximum which is the best possible approximation in 
this framework. The last two constraints would lead to an explosion of the size 
of weighted automata and computational complexity. In practice, implementa- 
tions of this method require considerable simplification of fundamental objects 
of the model: there is no serious attempt at processing compound words or 
ambiguity; the size of contexts is limited to two words to the left, and the size 
of tag sets to a few dozen tags, which is less than the tags et of Fig. 3.3. Fi- 
nally, taking into consideration the third constraint would increase the cost of 
the manual tagging of the learning corpus, or require resorting to automatic 
tagging, with a corresponding output of inferior quality. 
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Resorting to such statistical approximations of grammar, syntax and the 
lexicon of languages is natural in so far as sufficiently accurate and comprehen- 
sive data seem out of reach. However, this is a short-term approach: it does not 
contribute to the enhancement of knowledge in these areas, and the technologies 
required for gathering exploitable and maintainable linguistic data have little 
in common with example-based learning. We can draw a parallel with mete- 
orology: future weather depends on future physical data, or on physical data 
all around the world, including in marine areas where they are not measured 
with sufficient accuracy and frequency. Thus, weather is forecast on the basis 
of statistics about examples of past observations. However, designing weather 
forecast programs does not contribute to the advance of thermodynamics. 

We will now turn to the linguistic approach. In order to relate formal notions 
with applications, we will refer primarily to translation, which is not a success- 
fully automated operation yet, but which involves many of the basic operations 
in language processing. 


3.2.2. Pattern definition and matching 


Defining and matching patterns are two of these basic operations. In order 
to be able to translate a technical term like microwave oven, we must have 
a description of it, a method to locate occurrences in texts, and a link to a 
translation. The methods of description and location of such linguistic forms 
must take into account the existence of variants like the plural, microwave ovens, 
and possibly abbreviations like WO if they are in use in relevant source texts. 
Thus, many linguistic forms are in fact sets of variants, and the actual form of 
all variants cannot always be computed from a canonical form. For example, the 
abbreviation MWO cannot be predicted from microwave oven by capitalizing 
initials, which would yield MO; the equivalence between MWO and the full 
form cannot be automatically inferred, even if the acronym occurs in a sample 
of source texts, because an explicit link between them, like microwave oven 
(MWO), may be absent and, if present, would be ambiguous; etc. Thus the set 
of equivalent variants must often be manually constructed by linguists who are 
familiar with the field — a category of population which is often hard to find. 

We can associate in a natural way microwave oven and its variants in the 
finite automaton of Fig. 3.25. When several lines are included in the same state, 
like oven and ovens here, they label parallel paths. 

This type of automaton is more usual when there are more variants than with 
microwave oven. It is also used when the forms described are not equivalent, but 
constitute a small system which follows specific rules instead of general grammar 
rules of the language (Fig. 3.26). Such a system is called a local grammar. 

In very restricted domains, the vocabulary and the syntactic constructions 
used in actual texts can be so stereotyped that all variability can be described 
in this form. This is the case of short stock exchange reports, weather forecast 
reports, sport scores etc. Local grammars can be used for translation, but this 
implies linking two monolingual local grammars together, one for the source 
language and another for the target language. Individual phrases of a grammar 


Version June 23, 2004 


190 Symbolic Natural Language Processing 


Figure 3.26. A local grammar. 


must be specifically linked with phrases of the other, because they are not 
equivalent. 

Finite automata defining linguistic patterns can be used to locate occur- 
rences of the patterns in texts. When automata are as small as in the pre- 
ceding instances, simple algorithms are sufficient: automata are compiled into 
the more traditional format with labelled edges and numbered states; they are 
determinized; they are matched against each point of the text. 

A local grammar can be a representation of a subject of interest for a user 
in a text, for example one or several particular types of microwave ovens. In 
such a case, the local grammar can be used for text filtering, indexing and 
classification. Weights can be assigned to transitions in order to indicate the 
relevancy of paths with respect to the user’s interest. 

Comprehensive descriptions accounting for general language can reach im- 
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pressive sizes. A complete grammar of dates, including informal dates, e.g. 
before Christmas, recognizes thousands of sequences. To be readable, such a 
description is necessarily organized into several automata. jFrom the formal 
point of view, the principle of such an organization is simple: a general finite 
automaton invokes sub-automata by special labels. Sub-automata, in turn, can 
equally invoke other sub-automata. Recursiveness may be allowed or not. In 
Fig. 3.27, the general automaton for numbers from 1 to 999 written in letters 


numbers1to99 |) 


hundred 


seven 


eight 


nine 


ten 

eleven 
twelve 
thirteen 
fourteen 
fifteen 
sixteen 
seventeen 
eighteen 
nineteen 


Figure 3.27. An automaton invokes another. 


invokes the automaton for numbers from 1 to 99. The label for the second au- 
tomaton is shown in grey. The use of labels for automata facilitates linguistic 
description for another reason: the same automaton can be invoked from several 
points and thus shared. Invoking an automaton via a label is thus equivalent 
to substituting it for the label. With patterns like terms, dates or numbers, 
invocations usually do not make up cycles: actual substitution is theoretically 
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possible; it makes the set of automata equivalent to one finite automaton. How- 
ever, with large grammars, actual substitution can lead to an explosion in size. 
For example, M. Gross’s grammar of dates in French, which is organized into 
about 100 automata, becomes a 50-Mb automaton if sub-automata are system- 
atically substituted. In the case of large grammars, the algorithms for locating 
occurrences in texts efficiently are therefore different: sub-automata are kept 
distinct and the matching algorithm is nondeterministic. 

If cycles of invocations are allowed, the language recognized by the set of 
automata can be defined by reference to an equivalent context-free grammar 
(cf. section 1.6). The labels invoking sub-automata are the counterparts of 
variables, including the label of the general automaton which corresponds to 
the axiom of the grammar. Each of the automata is translated into a finite 
number of productions of the grammar. Such a set of automata is called a 
“recursive transition network” (RTN). 


3.2.3. Parsing 


If we consider more and more complex local grammars, we reach a point where 
the identification of a linguistic form depends on the identification of free con- 
stituents. Free constituents are syntactic constructs, like sentences or noun 
phrases, which involve open categories, like verbs or nouns, in their content. For 
example, recognizing the phrase take into account may imply identifying: 
e its subject, which cannot be any noun, e.g. not air, and 
e its free complement, which can occur before or after into account. 
Both are free constituents. The subject is a noun phrase, which comprises at 
least an open category, a noun. The free complement can be a noun phrase or a 
sentential clause: Maz took into account that Mary was early. The identification 
of these free and frozen constituents is required for complex applications like 
translation. 
Several features of RTNs make them adequate for the formal description of 
such phrases. 
e Free constituents can be represented by labels invoking other parts 
of the grammar. In the example of take into account, these labels will 
represent types of noun phrases, of sentences and of sentential clauses. 
Obviously, the labels are reusable from other points of the grammar, be- 
cause other phrases or verbs will accept the same types of subjects or of 
complements. 
e Small lexical variations and alternative constructions are described in par- 
allel paths of the automata, as in Fig. 3.28. 
e Recursiveness can be used for embeddings between syntactic constructs. 
In the example of Fig. 3.28, the phrase and the free constituents around 
it make up a sentence; the label S included in the automaton represents 
sentences. Thus, the rule is recursive. 
A large variety of syntactic constructions in natural languages can be ex- 
pressed in that way. A complete description of take into account, for example, 
should include passive, interrogative forms etc., and would be much larger than 
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account 
consideration 


<ADV> 


~ 


Figure 3.28. A sample of a grammar of take into account. 


this figure. In addition, the number of grammatical constructions in a language 
is in some way multiplied by the size of the lexicon, since different words do not 
enter into the same grammatical constructions. However, the construction of 
large grammars for thousands of phrases and verbs can be partially automated. 
General grammars are manually constructed in the form of parameterized RTNs, 
then they are adapted to specific lexical items like take into account by setting 
the values of the parameters. These values are encoded for each lexical item 
in tables of syntactic properties. A large proportion of the parameters must be 
at the level of specific lexical items, and not of classes of items (e.g. transitive 
verbs), because syntactic properties are incredibly dependent on actual lexical 
items. 

Here are two examples of open problems in the construction of grammars 
selectional constraints between predicates (i.e. verbs, nouns and adjectives) and 
their arguments (i.e. subject and essential complements): 


6. 


(Max + *The air) took into account that Mary was early 
and selectional constraints between predicates and adverbs: 
Maz took the delay into account (last time + *by plane) 


Present grammars either overgenerate or undergenerate when such constraints 
come into play. 

Even so, the construction of grammars of natural languages in the form of 
RTNs now appears to be within reach. 

This situation provides partial answers for a classical controversy about the 
most popular two formal models of syntax: finite automata and context-free 
grammars. The issue of the adequacy of these two models dates back to the 
time of their actual definition and is still going on. Infrequent constructions have 
been used to argue that both were inadequate, but they can be conveniently 
dealt with as exceptions. {From 1960 to 1990, the folklore of the domain held 
that it was reasonable practice to use context-free grammars, and a heresy to 
use automata. Since then, investigation results suggested that the RTN model, 
which is equivalent to grammars but relies heavily on the automaton form, is 


6In the next two examples, the star * marks that a sequence is not acceptable as a sentence. 
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convenient for the manual description of syntax as well as for automatic pars- 
ing. It is an open question as to whether the non-recursive counterpart of RTNs, 
which is equivalent to finite automata, will be better. Recursiveness can surely 
be eliminated from RTNs through an automatic compilation process, by sub- 
stituting cycles for terminal embeddings and by limiting central embeddings to 
a fixed maximal depth. But even without recursiveness, RT'N-based parsing is 
not necessarily more similar to automaton-based parsing than context-free pars- 
ing... In any case, the issue now appears less theoretical than computational. 


3.2.4. Lexical ambiguity reduction 


We mentioned lexical tagging in section 3.1.4. This operation consists of as- 
signing tags to words. Word tags record linguistic information. Lexical tagging 
is not an application in itself, since word tags contain encoded information not 
directly exploitable by users. However, lexical tagging is required for enhancing 
the results of nearly all operations on texts: translation, spelling correction, lo- 
cation of index terms etc. Section 3.1.4 shows how dictionary lookup contributes 
to lexical tagging, but many words should be assigned distinct tags in relation 
to context, like record, a noun or a verb. Such forms are said to be lexically 
ambiguous. Syntactic parsing often resolves all lexical ambiguity. Sentences like 
the following are rare: 


The newspapers found out some record 


This ambiguous sentence has two syntactic analyses: some record is a noun 
phrase or a sentential clause, and record is accordingly a noun or a verb. 

Syntactic parsing is not a mature technique yet, and there is a need for 
procedures that can work without complete syntactic grammars of languages, 
even if they resolve less lexical ambiguity than syntactic parsing. 

Such a procedure can be designed on the following basis. After dictionary 
lookup, a text can be represented as an acyclic automaton of analyses like that 
of Fig. 3.29. Syntactic constraints can be represented as an automaton over the 


deal{deal.N:s} 


deal{deal.V:W} 
deal{deal.V:P1s} 
deal{deal.V:P2s} 
deal{ deal V-Plp} soiled{ soiled.A} 


a{aDET:s} deal{ deal V:P2p} soiled{ soil VK} 
a{aNs} deal{ deal V:-P3p} soiled{soil.V:Ils} 
soiled{ soil V:I2s 
good deal{ good deal.N:s} |) Saari 
soiled{ soil. V:Ilp} 
a good deal{a good deal ADV} 


) soiled{ soil VI2p} 
soiled{ soil. V:13p} 


good{good.A} 
good{ good.N:s} 


though{though ADV} 
though{ though CONJ} 


Figure 3.29. The automaton of analyses of though a good deal soiled. 


same alphabet. Fig. 3.30 states that when the word good is a noun, it cannot 
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follow the indefinite determiner a. The label @ stands for a default symbol: 


good{ good.N:s} }) 


Figure 3.30. An automaton stating a syntactic constraint. 


it matches the next input symbol if, at this point of the automaton, no other 
symbol matches. The intersection of the two automata is shown in Fig. 3.31; it 


deal{deal.N:s} 


deal{ deal VW} 
deal{ deal V-P1s} 


good{goodA) )-7 “eel deal V.P2s} 
[a(an'sy ) — deal{ deal. V:Plp} soiled{ soiled. A} 
= . deal{deal.V:P2p} soiled{ soil V:K} 
_— good{goodN's} } | 4. a1( deal VP3p} soiled{ soil VIls} 
yan soiled{ soil. V-12s} 


good deal{ good deal. N:s} |) soiled{ soil V.3s} 
soiled{ soil V:Ilp} 


a good deal{a good deal ADV} |) soiled{ soil. V:I2p} 


soiled{ soil V:I3p} 


though{ though ADV} 
though{though.CONJ} 


Figure 3.31. The intersection of the two automata. 


represents those analyses of the text that obey the constraints. The intersection 
of two automata is an automaton that recognizes the intersection of the two 
languages recognized. It is constructed by a simple algorithm. Different syn- 
tactic constraints can be represented by different automata: since intersection 
is associative and commutative, the automata can be intersected in any order 
without changing the result. Thus, various syntactic constraints can be for- 
malized independently and accumulated in order to reduce progressively more 
lexical ambiguity. However, this approach needs a convenient interface to allow 
linguists to express the constraints in the form of automata. Automata like that 
of Fig. 3.30 can be directly constructed only in very simple cases. 

An alternative approach combines dictionary lookup and ambiguity resolu- 
tion in another way. It considers that the relevant data are (i) the probability 
for a given word to occur with a given tag, and (ii) the probability of occurrence 
of a sequence of words (or tags). Such probabilities are estimated on the basis of 
statistics in a tagged corpus. The resulting values are inserted into a weighted 
automaton to make up a model of language. This technique has been applied 
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to small tag sets, and the possibility of tagging compound words has not been 
seriously investigated. 


Notes 


The notion of formal model in linguistics emerged progressively. We will men- 
tion a few milestones on this path. During the first half of the twentieth century, 
Saussure stated clearly that language is a system and that form/meaning as- 
sociations are arbitrary. This was a first step towards the separation between 
syntax and semantics. The translation of this idea into practice owes much to 
the study of native American languages by Sapir 1921. During the second half 
of the century, Harris incorporated the information aspect into the study of the 
forms of language. In particular, he introduced the notion of transformation 
(Harris 1952, Harris 1970). Gross 1975, Gross 1979 originated the construction 
of tables of syntactic properties. The parameterized graphs of section 3.2.3 are 
used in Senellart 1998 and Paumier 2001. 

The theory of formal languages developed in parallel (Schiitzenberger and 
Chomsky 1963; Gross and Lentin 1967). Discussions arose during the same 
period of time about the adequacy of formal models for representing the behav- 
ior of speakers (Miller and Chomsky 1963) or the syntax of natural languages. 
Chomsky 1956, Chomsky 1957 mathematically “proved” that neither finite au- 
tomata nor context-free languages were adequate for syntax, but he used infre- 
quent constructions that can be conveniently dealt with as exceptions (Gross 
1995). Gross gave an impulse to the actual production of extensive descriptions 
of lexicon and syntax with finite automata. 

The observations that led to the statement of Zipf’s law (Zipf 1935) were 
not restricted to language. The results exposed in section 3.1.3 about Zipf’s law 
applied to written texts are based on Senellart 1999. 

Johnson 1972 investigated various ways of combining formal rules and estab- 
lished whether the result of combination can be represented as a finite automa- 
ton. The notion of sequential transducer originates from Schtitzenberger 1977. 
Two algorithms of minimization of sequential transducers are known (Breslauer 
1998; Béal and Carton 2001); the second one is based on successive contribu- 
tions by Choffrut 1979, Reutenauer 1990 and Mohri 1994 (see also Chapter 1). 
The definition of p-sequential transducers was proposed by Mohri 1994. The 
algorithm of construction of generalized sequential transducers is adapted from 
Roche 1997. 

The representation of finite automata as graphs with labels attached to 
states was introduced into language processing by Gross 1989 and Silberztein 
1994 (http://acl.1dc.upenn.edu/C/C94/C94-1095. pdf). The Unitex system 
(http: //www-igm.univ-mlv.fr/~unitex), implemented by Sébastien Paumier 
at the University of Marne-la- Vallée, is an open-source environment for language 
processing with automata and dictionaries. 

The use of the intersection of finite transducers for specifying and imple- 
menting morphological analysis and generation, and for lexical ambiguity reso- 
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lution, was first suggested by Koskenniemi 1983. Bimachines were introduced 
by Schiitzenberger 1961. The adaptation of bimachines to morphology and 
phonetics comes from Laporte 1997. 

Weighted automata and transducers are defined by Paz 1971 and Eilenberg 
1974. The FSM library (Mohri, Pereira, and Riley 2000) offers consistent tools 
related to weighted automata. 

Algorithms for deriving weights from statistics about occurrences of symbols 
or sequences in a learning corpus are available in handbooks, e.g. Jurafsky and 
Martin 2000. 
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4.0. Introduction 


The application of statistical methods to natural language processing has been 
remarkably successful over the past two decades. The wide availability of text 
and speech corpora has played a critical role in their success since, as for all 
learning techniques, these methods heavily rely on data. Many of the compo- 
nents of complex natural language processing systems, e.g., text normalizers, 
morphological or phonological analyzers, part-of-speech taggers, grammars or 
language models, pronunciation models, context-dependency models, acoustic 
Hidden-Markov Models (HMMs), are statistical models derived from large data 
sets using modern learning techniques. These models are often given as weighted 
automata or weighted finite-state transducers either directly or as a result of the 
approximation of more complex models. 

Weighted automata and transducers are the finite automata and finite-state 
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SEMIRING SET oe |e] ol 
Boolean [| {0.1} [Vv 
Probability | Ry | + 
Log | RUT =00, $00} | Sieg 0 

[RU {oo Foe} [mi 


Tropical 


RU{—o0, ro0} | min [+ | 
Table 4.1. Semiring examples. ®log is defined by: & Plog y = — log(e”* +e %). 


transducers described in Chapter 1 Section 1.5 with the addition of some weight 
to each transition. Thus, weighted finite-state transducers are automata in 
which each transition, in addition to its usual input label, is augmented with 
an output label from a possibly different alphabet, and carries some weight. The 
weights may correspond to probabilities or log-likelihoods or they may be some 
other costs used to rank alternatives. More generally, as we shall see in the next 
section, they are elements of a semiring set. Transducers can be used to define 
a mapping between two different types of information sources, e.g., word and 
phoneme sequences. The weights are crucial to model the uncertainty of such 
mappings. Weighted transducers can be used for example to assign different 
pronunciations to the same word but with different ranks or probabilities. 

Novel algorithms are needed to combine and optimize large statistical models 
represented as weighted automata or transducers. This chapter reviews several 
recent weighted transducer algorithms, including composition of weighted trans- 
ducers, determinization of weighted automata and minimization of weighted 
automata, which play a crucial role in the construction of modern statistical 
natural language processing systems. It also outlines their use in the design 
of modern real-time speech recognition systems. It discusses and illustrates 
the representation by weighted automata and transducers of the components of 
these systems, and describes the use of these algorithms for combining, search- 
ing, and optimizing large component transducers of several million transitions 
for creating real-time speech recognition systems. 


4.1. Preliminaries 


This section introduces the definitions and notation used in the following. 

A system (K,@,®,0,1) is a semiring if (K,@,0) is a commutative monoid 
with identity element 0, (K,®,1) is a monoid with identity element T, @ dis- 
tributes over ®, and 0 is an annihilator for ®: for alla € K,a@0=0@a=0. 
Thus, a semiring is a ring that may lack negation. Table 4.1 lists some familiar 
semirings. In addition to the Boolean semiring, and the probability semiring 
used to combine probabilities, two semirings often used in text and speech pro- 
cessing applications are the log semiring which is isomorphic to the probability 
semiring via the negative-log morphism, and the tropical semiring which is de- 
rived from the log semiring using the Viterbi approzimation. A left semiring is 
a system that verifies all the axioms of a semiring except from the right ditribu- 
tivity. In the following definitions, K will be used to denote a left semiring or a 
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semiring. 

A semiring is said to be commutative when the multiplicative operation @ 
is commutative. It is said to be left divisible if for any x 4 0, there exists 
y € K such that y @ x = 1, that is if all elements of K admit a left inverse. 
(K, ®, @, 0,1) is said to be weakly left divisible if for any x and y in K such that 
x@y #0, there exists at least one z such that 7 = (x @y) @ z. The @-operation 
is cancellative if z is unique and we can write: z = (a 6 y)~!w. When 2 is not 
unique, we can still assume that we have an algorithm to find one of the possible 
z and call it (« 6 y)~‘a. Furthermore, we will assume that z can be found in 
a consistent way, that is: ((u@z)@(u@y))'(u@z) = (« 6 y) ‘2 for any 
x,y,u € K such that u# 0. A semiring is zero-sum-free if for any x and y in K, 
x ®y =O implies x = y = 0. 

A weighted finite-state transducer { over a semiring K is an 8-tuple T = 
(A,B, Q,1I, F, E,, p) where: A is the finite input alphabet of the transducer; B 
is the finite output alphabet; Q is a finite set of states; J C @ the set of initial 
states; FC Q the set of final states; FE C Q x (AU {e}) x (BU{e}) x Kx Q a finite 
set of transitions; A: J — K the initial weight function; and p: F — K the final 
weight function mapping F' to K. E[q] denotes the set of transitions leaving a 
state q € Q. |X| denotes the sum of the number of states and transitions of T. 

Weighted automata are defined in a similar way by simply omitting the input 
or output labels. Let I, ({) (Ilp({)) denote the weighted automaton obtained 
from a weighted transducer T by omitting the input (resp. output) labels of T. 

Given a transition e € EF, let ple] denote its origin or previous state, n[e] 
its destination state or next state, i{e] its input label, ofe] its output label, 
and wie] its weight. A path 7 = e;---e, is an element of E* with consecutive 
transitions: nle;1] = plei], 7 = 2,...,k. mn, p, and w can be extended to 
paths by setting: n[z] = n[ex] and p[z] = plei] and by defining the weight of 
a path as the ®-product of the weights of its constituent transitions: w/z] = 
wie] ®--- ® wlex]. More generally, w is extended to any finite set of paths R 
by setting: w[R] = @,-rwl[z]. Let P(q,q’) denote the set of paths from q to 
qd and P(q,x,y,q') the set of paths from q to q’ with input label  € A* and 
output label y € B*. These definitions can be extended to subsets R, R’ C Q, 
by: P(R,2,y, R’) = Uger, ver P(g, z,y, 7). A transducer T is regulated if the 
weight associated by T to any pair of input-output string (x, y) given by: 


[Hey = QB Able] wfx] @ pfr[a}] (4.1.1) 


nweEP(,2x,y,F) 
is well-defined and in K. [I](z,y) = 0 when P(/,2,y,F) = 0. In particular, 


when it does not have any e-cycle, { is always regulated. 


4.2. Algorithms 


4.2.1. Composition 


Composition is a fundamental algorithm used to create complex weighted trans- 
ducers from simpler ones. It is a generalization of the composition algorithm 
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presented in Chapter 1 Section 1.5 for unweighted finite-state transducers. Let 
K be a commutative semiring and let {; and {2 be two weighted transducers 
defined over K such that the input alphabet of {2 coincides with the output al- 
phabet of {;. Assume that the infinite sum Q, 41 (zx, z)®@L2(z, y) is well-defined 
and in K for all (a, y) € A* xC*. This condition holds for all transducers defined 
over a closed semiring such as the Boolean semiring and the tropical semiring 
and for all acyclic transducers defined over an arbitrary semiring. Then, the 
result of the composition of {1 and {2 is a weighted transducer denoted by 
¥, oT, and defined for all x, y by: 


[Zi 0 Ta] (x,y) =D Tila, 2) ® Ta(z,y) (4.2.1) 


Note that we use a matrix notation for the definition of composition as opposed 
to a functional notation. There exists a general and efficient composition al- 
gorithm for weighted transducers. States in the composition T, o Tz of two 
weighted transducers {1 and {2 are identified with pairs of a state of {1 and 
a state of T2. Leaving aside transitions with ¢ inputs or outputs, the following 
rule specifies how to compute a transition of {0% from appropriate transitions 
of {, and To: 


(q1, a, b, Wi; q2) and (1; b, Cc, W2, q2) = (a1, %); a,C,W1 ® W2, (q2, q2)) (4.2.2) 


The following is the pseudocode of the algorithm in the e-free case. 


WEIGHTED-COMPOSITION(Sj, 2) 


1 Q — I, x Tp 
2 SH qi x Ip 
3 while 549 do 
4 (q1, 92) — HEAD(S) 
5 DEQUEUE(S) 
6 if (q1; 92) E Ty x Ip then 
7 T—IU{(q1,0)} 
8 (qi, G2) — Ar(qi) ® A2(q@2) 
9 if (q1, G2) E Fy x Fy then 
10 Fe FU{(q,)} 
11 P(q1; 92) — pi(q) @ pe2(qa2) 
12 for each (e€1,e2) € Elqi] x E[q2] such that o[e1] = z[e2] do 
13 if (n[ei], n[e2]) ¢ Q then 
m Q— QU{(nler}, n{es))} 
15 ENQUEUE(S, (n[e1], n[e2])) 
16 E— EU {((q, 42), te1], ole2], wes] ® wea], (nfer], nle2}))} 
17 return 


The algorithm takes as input T, = (A,6,Q1,h,/1, Fi,A1,p1) and T, = 
(B,C, Qo, Ia, Fo, E2, 2, p2), two weighted transducers, and outputs a weighted 
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a:a/0.6 


Figure 4.1. (a) Weighted transducer {1 over the probabilityl semiring. 
(b) Weighted transducer {2 over the probability semiring. (c) Composi- 
tion of {1 and Zz. Initial states are represented by an incoming arrow, 
final states with an outgoing arrow. Inside each circle, the first number 
indicates the state number, the second, at final states only, the value of 
the final weight function p at that state. Arrows represent transitions and 
are labeled with symbols followed by their corresponding weight. 


transducer T = (A,C,Q,JI, F, E,,p) realizing the composition of T, and T,. 
FE, I, and F are all assumed to be initialized to the empty set. 

The algorithm uses a queue S containing the set of pairs of states yet to 
be examined. The queue discipline of S can be arbitrarily chosen and does 
not affect the termination of the algorithm. The set of states @ is originally 
reduced to the set of pairs of the initial states of the original transducers and S 
is initialized to the same (lines 1-2). Each time through the loop of lines 3-16, a 
new pair of states (q1,q2) is extracted from S (lines 4-5). The initial weight of 
(qi, q2) is computed by @-multiplying the initial weights of gq; and gz when they 
are both initial states (lines 6-8). Similar steps are followed for final states (lines 
9-11). Then, for each pair of matching transitions (e1,e2), a new transition is 
created according to the rules specified earlier (line 16). If the destination state 
(n[e1], n[e2]) has not been found before, it is added to Q and inserted in S (lines 
14-15). 

In the worst case, all transitions of {1 leaving a state q, match all those 
of Tp leaving state qj, thus the space and time complexity of composition is 
quadratic: O(|%1||%2|). However, a lazy implementation of composition can 
be used to construct just the part of the composed transducer that is needed. 
Figures 4.1(a)-(c) illustrate the algorithm when applied to the transducers of 
Figures 4.1(a)-(b) defined over the probability semiring. 

More care is needed when {1 admits output ¢ labels and T2 input ¢ labels. 
Indeed, as illustrated by Figure 4.2, a straightforward generalization of the e- 
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/1 +o “AG <6--6-_ On 


c:e/1 


ol (0) 2G) EA) BAG 


Figure 4.2. Redundant ¢-paths. A straightforward generalization of 
the e-free case could generate all the paths from (1,1) to (3,2) when 
composing the two simple transducers on the right-hand side. 


free case would generate redundant ¢-paths and, in the case of non-idempotent 
semirings, would lead to an incorrect result. The weight of the matching paths 
of the original transducers would be counted p times, where p is the number of 
redundant paths in the result of composition. 

To cope with this problem, all but one ¢-path must be filtered out of the com- 
posite transducer. Figure 4.2 indicates in boldface one possible choice for that 
path, which in this case is the shortest. Remarkably, that filtering mechanism 
can be encoded as a finite-state transducer. 

Let 1 (Tz) be the weighted transducer obtained from $1 (resp. 2) by 
replacing the output (resp. input) € labels with eg (resp. €1), and let § be the 
filter finite-state transducer repr esented in Figure 4.3. Then {10% of) = = Jjo8o. 
Since the two compositions in cs oFoT, do not involve e’s, the e-free composition 
already described can be used to compute the resulting transducer. 

Intersection (or Hadamard product) of weighted automata and composition 
of finite-state transducers are both special cases of composition of weighted 
transducers. Intersection corresponds to the case where input and output la- 
bels of transitions are identical and composition of unweighted transducers is 
obtained simply by omitting the weights. 

In general, the definition of composition cannot be extended to the case of 
non-commutative semirings because the composite transduction cannot always 
be represented by a weighted finite-state transducer. Consider for example, the 
case of two transducers T; and Tz accepting the same set of strings (a, a)*, with 
[Xi] (a,a) = a € K and [%2](a,a) = y € K and let 7 be the composite of the 
transductions corresponding to {; and Tz. Then, for any non-negative integer 
n, T(a”,a”) = 2” @ y” which in general is different from (« @ y)” if x and y 
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Figure 4.3. Filter for composition ¥. 


do not commute. An argument similar to the classical Pumping lemma can 
then be used to show that 7 cannot be represented by a weighted finite-state 
transducer. 

When % and £2 are acyclic, composition can be extended to the case of non- 
commutative semirings. The algorithm would then consist of matching paths 
of {; and { directly rather than matching their constituent transitions. The 
termination of the algorithm is guaranteed by the fact that the number of paths 
of {, and Tp is finite. However, the time and space complexity of the algorithm 
is then exponential. 

The weights of matching transitions and paths are ®-multiplied in composi- 
tion. One might wonder if another useful operation, x, can be used instead of 
®, in particular when K is not commutative. The following proposition proves 
that that cannot be. 


PROPOSITION 4.2.1. Let (K,x,e) be a monoid. Assume that x is used in- 
stead of ® in composition. Then, x coincides with ® and (K,®,®,0,1) is a 
commutative semiring. 


Proof. Consider two sets of consecutive transitions of two paths: a, = 
(p1,, 0,2, 91)(q1,6,b,y,71) and m2 = (po,a,a,u, q2)(qo,b,b,v, 72). Matching 
these transitions using x result in the following: 


((p1, p2), 4, a, © x U, (q1,92)) and ((q1, 42), 0, b, y x v, (T1,72)) (4.2.3) 


Since the weight of the path obtained by matching 7 and 72 must also corre- 
spond to the x-multiplication of the weight of 71, x ® y, and the weight of 7, 
u® v, we have: 

(x1 xu) @(y Xv) =(e@@y) x (uSv) (4.2.4) 
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This identity must hold for all x,y, u,v € K. Setting u = y =e and v=T leads 
to x =x @e and similarly x = e ® «x for all x. Since the identity element of ® 
is unique, this proves that e = T. 

With u = y = I, identity 4.2.4 can be rewritten as:  @v = 2 x v for all x 
and v, which shows that x coincides with ®. Finally, setting x = v = I gives 
u®y=y xu for all y and u which shows that ® is commutative. a 


4.2.2. Determinization 


This section describes a generic determinization algorithm for weighted au- 
tomata. It is thus a generalization of the determinization algorithm for un- 
weighted finite automata. When combined with the (unweighted) determiniza- 
tion for finite-state transducers presented in Chapter 1 Section 1.5, it leads to 
an algorithm for determinizing weighted transducers.! 

A weighted automaton is said to be deterministic or subsequential if it has 
a unique initial state and if no two transitions leaving any state share the same 
input label. There exists a natural extension of the classical subset construc- 
tion to the case of weighted automata over a weakly left divisible left semiring 
called determinization.? The algorithm is generic: it works with any weakly left 
divisible semiring. The pseudocode of the algorithm is given below with Q’, I’, 
F’, and E’ all initialized to the empty set. 


WEIGHTED-DETERMINIZATION (Ql) 
1 wv — {(4,A(4)) :2 € TH 


2 N)AT 
3S {i} 
4 while $40 do 
5 p’ — Heap(S) 
6 DEQUEUE(S) 
7 for each x € i[E[Q[p’]]] do 
8 w — O{v@w: (p,v) €p’, (p, a, w,q) € EB} 
9 d —{(,@ {w’! @(v@w) : (p,v) € p’, (p, x, w, gq) € E}): 
q = nlel,i[e] = 2,¢ € ElQlp'l} 
10 BE’ - BE U{(p',2,w',d)} 
11 if qd ¢ Q’ then 
12 e-Qu{d} 
13 if Q[7]N F #0 then 
14 Fl — F’U {qd} 
15 (7) — PBiv@ pa): @v)eq,qe F} 


16 ENQUEUE(S, q’) 
17 return T’ 


Tn reality, the determinization of unweighted and that of weighted finite-state transducers 
can both be viewed as special instances of the generic algorithm presented here but, for clarity 
purposes, we will not emphasize that view in what follows. 

?We assume that the weighted automata considered are all such that for any string « € A*, 
w[P(U,x,Q)| 4 0. This condition is always satisfied with trim machines over the tropical 
semiring or any zero-sum-free semiring. 
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A weighted subset p’ of Q is a set of pairs (q,x) € Q x K. Q[p’] denotes the 
set of states q of the weighted subset p’. E[Q[p’]] represents the set of transitions 
leaving these states, and i[E[Q[p’]]] the set of input labels of these transitions. 

The states of the output automaton can be identified with (weighted) subsets 
of the states of the original automaton. A state r of the output automaton 
that can be reached from the start state by a path a is identified with the 
set of pairs (¢,z) € Q x K such that qg can be reached from an initial state 
of the original machine by a path o with i[o] = i[z] and A[p[o]] ® wl{o] = 
A|p|z]] @ wiz] @ x. Thus, x can be viewed as the residual weight at state q. 
When it terminates, the algorithm takes as input a weighted automaton 2% = 
(A, Q,I, F,E,2,p) and yields an equivalent subsequential weighted automaton 
QW = (A, Q’,1', F’, E',X, p'). 

The algorithm uses a queue S containing the set of states of the resulting 
automaton 2’, yet to be examined. The queue discipline of S can be arbitrarily 
chosen and does not affect the termination of the algorithm. 2’ admits a unique 
initial state, 7’, defined as the set of initial states of 21 augmented with their 
respective initial weights. Its input weight is T (lines 1-2). 9 originally contains 
only the subset 2’ (line 3). Each time through the loop of lines 4-16, a new 
subset p’ is extracted from S (lines 5-6). For each x labeling at least one of 
the transitions leaving a state p of the subset p’, a new transition with input 
label x is constructed. The weight w’ associated to that transition is the sum of 
the weights of all transitions in E[Q[p’]] labeled with x pre-®-multiplied by the 
residual weight v at each state p (line 8). The destination state of the transition 
is the subset containing all the states q reached by transitions in E[Q[p']] labeled 
with x. The weight of each state q of the subset is obtained by taking the 6-sum 
of the residual weights of the states p ®-times the weight of the transition from 
p leading to gq and by dividing that by w’. The new subset q’ is inserted in the 
queue S when it is a new state (line 15). If any of the states in the subset gq’ 
is final, q’ is made a final state and its final weight is obtained by summing 
the final weights of all the final states in q’, pre-®-multiplied by their residual 
weight v (line 14). 

Figures 4.4(a)-(b) illustrate the determinization of a weighted automaton 
over the tropical semiring. The worst case complexity of determinization is 
exponential even in the unweighted case. However, in many practical cases 
such as for weighted automata used in large-vocabulary speech recognition, this 
blow-up does not occur. It is also important to notice that just like composition, 
determinization admits a natural lazy implementation which can be useful for 
saving space. 

Unlike the unweighted case, determinization does not halt on all input 
weighted automata. In fact, some weighted automata, non subsequentiable au- 
tomata, do not even admit equivalent subsequential machines. But even for 
some subsequentiable automata, the algorithm does not halt. We say that a 
weighted automaton 2 is determinizable if the determinization algorithm halts 
for the input 2%. With a determinizable input, the algorithm outputs an equiv- 
alent subsequential weighted automaton. 

There exists a general twins property for weighted automata that provides a 
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Figure 4.4. Determinization of weighted automata. (a) Weighted au- 
tomaton over the tropical semiring 2%. (b) Equivalent weighted automaton 
$% obtained by determinization of 2%. (c) Non-determinizable weighted au- 
tomaton over the tropical semiring, states 1 and 2 are non-twin siblings. 


characterization of determinizable weighted automata under some general con- 
ditions. Let 2l be a weighted automaton over a weakly left divisible left semiring 
K. Two states g and q’ of 2 are said to be siblings if there exist two strings x 
and y in A* such that both qg and q’ can be reached from J by paths labeled 
with x and there is a cycle at q and a cycle at q’ both labeled with y. When 
K is a commutative and cancellative semiring, two sibling states are said to be 
twins iff for any string y: 


w[P(ay, 9] = v[P(a',y, 7’) (4.2.5) 


2M has the twins property if any two sibling states of 2% are twins. Figure 4.4(c) 
shows an unambiguous weighted automaton over the tropical semiring that does 
not have the twins property: states 1 and 2 can be reached by paths labeled 
with a from the initial state and admit cycles with the same label b, but the 
weights of these cycles (3 and 4) are different. 


THEOREM 4.2.2. Let 2 be a weighted automaton over the tropical semiring. 
If 21 has the twins property, then 2 is determinizable. 


With trim unambiguous weighted automata, the condition is also necessary. 


THEOREM 4.2.3. Let 2 be a trim unambiguous weighted automaton over the 
tropical semiring. Then the three following properties are equivalent: 

1. 2 is determinizable. 

2. 2 has the twins property. 

3. I is subsequentiable. 


There exists an efficient algorithm for testing the twins property for weighted 
automata, which cannot be presented briefly in this chapter. Note that any 
acyclic weighted automaton over a zero-sum-free semiring has the twins property 
and is determinizable. 
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4.2.3. Weight pushing 


The choice of the distribution of the total weight along each successful path of 
a weighted automaton does not affect the definition of the function realized by 
that automaton, but this may have a critical impact on the efficiency in many 
applications, e.g., natural language processing applications, when a heuristic 
pruning is used to visit only a subpart of the automaton. There exists an 
algorithm, weight pushing, for normalizing the distribution of the weights along 
the paths of a weighted automaton or more generally a weighted directed graph. 
The transducer normalization algorithm presented in Chapter 1 Section 1.5 can 
be viewed as a special instance of this algorithm. 

Let 2 be a weighted automaton over a semiring K. Assume that K is zero- 
sum-free and weakly left divisible. For any state q € Q, assume that the follow- 
ing sum is well-defined and in K: 


dd= QD Ww) @ olntel)) (4.2.6) 
rEP(q,F) 
d[q| is the shortest-distance from q to F’. d[q| is well-defined for all g € Q when K 
is a k-closed semiring. The weight pushing algorithm consists of computing each 
shortest-distance d[q] and of reweighting the transition weights, initial weights 
and final weights in the following way: 


Ve € Es.t. d[ple]] 4 0, w[e] — d[ple]]~* ® wie] @ d[nfe]] (4.2.7) 
Yq € I, Alg] — Alq] @ d[q] (4.2.8) 
Vq € F, s.t. dig] £0, la] — d{q]~* ® pla] (4.2.9) 


Each of these operations can be assumed to be done in constant time, thus 
reweighting can be done in linear time O(Ta@|2|) where Te denotes the worst 
cost of an ®-operation. The complexity of the computation of the shortest- 
distances depends on the semiring. In the case of k-closed semirings such as the 
tropical semiring, d[q], q € Q, can be computed using a generic shortest-distance 
algorithm. The complexity of the algorithm is linear in the case of an acyclic 
automaton: O(Card(Q)+(Te+Te) Card(£)), where Tg denotes the worst cost 
of an @-operation. In the case of a general weighted automaton over the tropical 
semiring, the complexity of the algorithm is O(Card(£)+Card(Q) log Card(Q)). 

In the case of closed semirings such as (R+,+,x,0,1), a generalization of 
the Floyd-Warshall algorithm for computing all-pairs shortest-distances can be 
used. The complexity of the algorithm is @(Card(Q)?(Ta+Te+T.)) where Ty. 
denotes the worst cost of the closure operation. The space complexity of these 
algorithms is O(Card(Q)?). These complexities make it impractical to use the 
Floyd-Warshall algorithm for computing d[q], q € Q, for relatively large graphs 
or automata of several hundred million states or transitions. An approximate 
version of a generic shortest-distance algorithm can be used instead to compute 
d{q] efficiently. 

Roughly speaking, the algorithm pushes the weights of each path as much as 
possible towards the initial states. Figures 4.5(a)-(c) illustrate the application 
of the algorithm in a special case both for the tropical and probability semirings. 
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Figure 4.5. Weight pushing algorithm. (a) Weighted automaton 2. 
(b) Equivalent weighted automaton 8% obtained by weight pushing in the 
tropical semiring. (c) Weighted automaton € obtained from 2 by weight 
pushing in the probability semiring. (d) Minimal weighted automaton 
over the tropical semiring equivalent to 2. 


Note that if d[g] = 0, then, since K is zero-sum-free, the weight of all paths 
from q to F is 0. Let % be a weighted automaton over the semiring K. Assume 
that K is closed or k-closed and that the shortest-distances d[q] are all well- 
defined and in K— {0}. Note that in both cases we can use the distributivity over 
the infinite sums defining shortest distances. Let e’ (z’) denote the transition e 
(path 7) after application of the weight pushing algorithm. e’ (z’) differs from 
e (resp. 7) only by its weight. Let \’ denote the new initial weight function, 
and p’ the new final weight function. 


PROPOSITION 4.2.4. Let 8 = (A,Q,I, F, E’,X,') be the result of the weight 
pushing algorithm applied to the weighted automaton 2, then 
1. the weight of a successful path 7 is unchanged after application of weight 
pushing: 


V[plr']] @ wir!) @ p'[nlx"|] = Appin] @ wir] @ pinfx]] (4.2.10) 
2. the weighted automaton © is stochastic, i.e. 
vaeQ, @B wie] =1 (4.2.11) 
e’€E'[q] 
Proof. Let 7m’ =e',...e),. By definition of \’ and p’, 
X'[pla']] @ w[x’] ® p'[n{a’]] = Alples]] @ [plex] ® d[pfer]]~* @ wler] ® d[nfer]] --- 


@d[plex]]~* ® wlex] @ d[nfex]] ® d[nfex]]~* ® p[rtz] 
= A[p[x]] ® wle1] ® --- @ wlex] ® p[n[r]] 


which proves the first statement of the proposition. Let ¢ € Q, 
BD wel = @ daq* @ vfe] @ d[nfel] 
e’€E'(q] e€ E{q] 


=dq'® @ we] @ dintel] 


ec El] 
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=dq"' 2 @ vela QD wl] @plntr])) 
celal re P(nlel,F) 


= dq)" ® <P) (wle]  w[r] @ p[n[r]]) 


e€ B[q],n€ P(nle],F) 


= d{q|~* @d[q) =1 


where we used the distributivity of the multiplicative operation over infinite 
sums in closed or k-closed semirings. This proves the second statement of the 
proposition. rT 


These two properties of weight pushing are illustrated by Figures 4.5(a)-(c): the 
total weight of a successful path is unchanged after pushing; at each state of 
the weighted automaton of Figure 4.5(b), the minimum weight of the outgoing 
transitions is 0, and at at each state of the weighted automaton of Figure 4.5(c), 
the weights of outgoing transitions sum to 1. Weight pushing can also be used 
to test the equivalence of two weighted automata. 


4.2.4. Minimization 


A deterministic weighted automaton is said to be minimalif there exists no other 
deterministic weighted automaton with a smaller number of states and realizing 
the same function. Two states of a deterministic weighted automaton are said to 
be equivalent if exactly the same set of strings with the same weights label paths 
from these states to a final state, the final weights being included. Thus, two 
equivalent states of a deterministic weighted automaton can be merged without 
affecting the function realized by that automaton. A weighted automaton is 
minimal when it admits no two distinct equivalent states after any redistribution 
of the weights along its paths. 

There exists a general algorithm for computing a minimal deterministic au- 
tomaton equivalent to a given weighted automaton. It is thus a generalization 
of the minimization algorithms for unweighted finite automata. It can be com- 
bined with the minimization algorithm for unweighted finite-state transducers 
presented in Chapter 1 Section 1.5 to minimize weighted finite-state transduc- 
ers.? It consists of first applying the weight pushing algorithm to normalize the 
distribution of the weights along the paths of the input automaton, and then 
of treating each pair (label, weight) as a single label and applying the classical 
(unweighted) automata minimization. 


THEOREM 4.2.5. Let 2l be a deterministic weighted automaton over a semiring 
Kk. Assume that the conditions of application of the weight pushing algorithm 
hold, then the execution of the following steps: 

1. weight pushing, 

2. (unweighted) automata minimization, 

3Tn reality, the minimization of both unweighted and weighted finite-state transducers can 


be viewed as special instances of the algorithm presented here, but, for clarity purposes, we 
will not emphasize that view in what follows. 
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Figure 4.6. Minimization of weighted automata. (a) Weighted automa- 
ton &’ over the probability semiring. (b) Minimal weighted automaton 
%’ equivalent to 4’. (c) Minimal weighted automaton €’ equivalent to 2’. 


lead to a minimal weighted automaton equivalent to 2. 


The complexity of automata minimization is linear in the case of acyclic au- 
tomata O(Card(Q) + Card(F£)) and in O(Card(£) log Card(Q)) in the general 
case. Thus, in view of the complexity results given in the previous section, in 
the case of the tropical semiring, the total complexity of the weighted mini- 
mization algorithm is linear in the acyclic case O(Card(Q) + Card(£)) and in 
O(Card(£) log Card(Q)) in the general case. 

Figures 4.5(a), 4.5(b), and 4.5(d) illustrate the application of the algorithm 
in the tropical semiring. The automaton of Figure 4.5(a) cannot be further 
minimized using the classical unweighted automata minimization since no two 
states are equivalent in that machine. After weight pushing, the automaton 
(Figure 4.5(b)) has two states (1 and 2) that can be merged by the classical 
unweighted automata minimization. 

Figures 4.6(a)-(c) illustrate the minimization of an automaton defined over 
the probability semiring. Unlike the unweighted case, a minimal weighted au- 
tomaton is not unique, but all minimal weighted automata have the same graph 
topology, they only differ by the way the weights are distributed along each 
path. The weighted automata %’ and €’ are both minimal and equivalent to 
QU’. B’ is obtained from 2’ using the algorithm described above in the probabil- 
ity semiring and it is thus a stochastic weighted automaton in the probability 
semiring. 

For a deterministic weighted automaton, the first operation of the semir- 
ing can be arbitrarily chosen without affecting the definition of the function 
it realizes. This is because, by definition, a deterministic weighted automaton 
admits at most one path labeled with any given string. Thus, in the algorithm 
described in theorem 4.2.5, the weight pushing step can be executed in any 
semiring K’ whose multiplicative operation matches that of K. The minimal 
weighted automata obtained by pushing the weights in K’ is also minimal in K 
since it can be interpreted as a (deterministic) weighted automaton over K. 

In particular, 2’ can be interpreted as a weighted automaton over the semir- 
ing (R;,max, x,0,1). The application of the weighted minimization algorithm 
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to 2’ in this semiring leads to the minimal weighted automaton €’ of Fig- 
ure 4.6(c). €’ is also a stochastic weighted automaton in the sense that, at any 
state, the maximum weight of all outgoing transitions is one. 

This fact leads to several interesting observations. One is related to the 
complexity of the algorithms. Indeed, we can choose a semiring K’ in which 
the complexity of weight pushing is better than in K. The resulting automaton 
is still minimal in K and has the additional property of being stochastic in K’. 
It only differs from the weighted automaton obtained by pushing weights in 
K in the way weights are distributed along the paths. They can be obtained 
from each other by application of weight pushing in the appropriate semiring. 
In the particular case of a weighted automaton over the probability semiring, 
it may be preferable to use weight pushing in the (max, x )-semiring since the 
complexity of the algorithm is then equivalent to that of classical single-source 
shortest-paths algorithms. The corresponding algorithm is a special instance of 
the generic shortest-distance algorithm. 

Another important point is that the weight pushing algorithm may not be 
defined in K because the machine is not zero-sum-free or for other reasons. 
But an alternative semiring K’ can sometimes be used to minimize the input 
weighted automaton. 

The results just presented were all related to the minimization of the num- 
ber of states of a deterministic weighted automaton. The following proposition 
shows that minimizing the number of states coincides with minimizing the num- 
ber of transitions. 


PROPOSITION 4.2.6. Let 2 be a minimal deterministic weighted automaton, 
then 2( has the minimal number of transitions. 


Proof. Let 2 be a deterministic weighted automaton with the minimal number 
of transitions. If two distinct states of 2( were equivalent, they could be merged, 
thereby strictly reducing the number of its transitions. Thus, 2 must be a 
minimal deterministic automaton. Since, minimal deterministic automata have 
the same topology, in particular the same number of states and transitions, this 
proves the proposition. 


4.3. Application to speech recognition 


Much of the statistical techniques now widely used in natural language process- 
ing were inspired by early work in speech recognition. This section discusses 
the representation of the component models of an automatic speech recogni- 
tion system by weighted transducers and describes how they can be combined, 
searched, and optimized using the algorithms described in the previous sec- 
tions. The methods described can be used similarly in many other areas of 
natural language processing. 
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4.3.1. Statistical formulation 


Speech recognition consists of generating accurate written transcriptions for spo- 
ken utterances. The desired transcription is typically a sequence of words, but it 
may also be the utterance’s phonemic or syllabic transcription or a transcription 
into any other sequence of written units. 

The problem can be formulated as a maximum-likelihood decoding problem, 
or the so-called noisy channel problem. Given a speech utterance, speech recog- 
nition consists of determining its most likely written transcription. Thus, if we 
let o denote the observation sequence produced by a signal processing system, w 
a (word) transcription sequence over an alphabet A, and P(w | 0) the probabil- 
ity of the transduction of o into w, the problem consists of finding w as defined 
by: 

w = argmax P(w | 0) (4.3.1) 
we A* 
Using Bayes’ rule, P(w | 0) can be rewritten as: oes Since P(o) does 
not depend on w, the problem can be reformulated as: 


w = argmax P(o | w) P(w) (4.3.2) 
weA* 


where P(w) is the a priori probability of the written sequence w in the language 
considered and P(o | w) the probability of observing o given that the sequence 
w has been uttered. The probabilistic model used to estimate P(w) is called 
a language model or a statistical grammar. The generative model associated 
to P(o | w) is a combination of several knowledge sources, in particular the 
acoustic model, and the pronunciation model. P(o | w) can be decomposed into 
several intermediate levels e.g., that of phones, syllables, or other units. In most 
large-vocabulary speech recognition systems, it is decomposed into the following 
probabilistic models that are assumed independent: 


e P(p|w), a pronunciation model or lexicon transducing word sequences w 
to phonemic sequences p; 


e P(c| p), a context-dependency transduction mapping phonemic sequences 
p to context-dependent phone sequences c; 


e P(d| c), a context-dependent phone model mapping sequences of context- 
dependent phones c to sequences of distributions d; and 


e P(o| d), an acoustic model applying distribution sequences d to observa- 
tion sequences.* 


Since the models are assumed to be independent, 


P(o|w) = }) P(o| d)P(d| c)P(c| p)P(p | w) (4.3.3) 
d,c,p 


*P(o | d)P(d | c) or P(o | d)P(d | c)P(c | p) is often called an acoustic model. 
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Equation 4.3.2 can thus be rewritten as: 


w = argmax - P(o| d)P(d | c)P(c| p)P(p | w)P(w) (4.3.4) 
a d,c,p 


The following sections discuss the definition and representation of each of these 
models and that of the observation sequences in more detail. The transduction 
models are typically given either directly or as a result of an approximation as 
weighted finite-state transducers. Similarly, the language model is represented 
by a weighted automaton. 


4.3.2. Statistical grammar 


In some relatively restricted tasks, the language model for P(w) is based on 
an unweighted rule-based grammar. But, in most large-vocabulary tasks, the 
model is a weighted grammar derived from large corpora of several million words 
using statistical methods. The purpose of the model is to assign a probability 
to each sequence of words, thereby assigning a ranking to all sequences. Thus, 
the parsing information it may supply is not directly relevant to the statistical 
formulation described in the previous section. 

The probabilistic model derived from corpora may be a probabilistic context- 
free grammmar. But, in general, context-free grammars are computationally 
too demanding for real-time speech recognition systems. The amount of work 
required to expand a recognition hypothesis can be unbounded for an unre- 
stricted grammar. Instead, a regular approximation of a probabilistic context- 
free grammar is used. In most large-vocabulary speech recognition systems, the 
probabilistic model is in fact directly constructed as a weighted regular gram- 
mar and represents an n-gram model. Thus, this section concentrates on a brief 
description of these models.° 

Regardless of the structure of the model, using the Bayes’s rule, the probabil- 
ity of the word sequence w = w,-:: wp, can be written as the following product 
of conditional probabilities: 


k 
P(w) = [[ P(wi | wi--- wi-1) (4.3.5) 


4=1 


An n-gram model is based on the Markovian assumption that the probability 
of the occurrence of a word only depends on the n — 1 preceding words, that is, 
for ¢= 1..." 


where the conditioning history h; has length at most n—1: |h;| <n—1. Thus, 


k 


P(w) = |] P(w; | hi) (4.3.7) 


i=1 


5Similar probabilistic models are designed for biological sequences (see Chapter 6). 
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bye/8.318 


(=) S : bye/7.625 


hellof?.625 


Figure 4.7. Katz back-off n-gram model. (a) Representation of a trigram 
model with failure transitions labeled with ®. (b) Bigram model derived 
from the input text hello bye bye. The automaton is defined over the log 
semiring (the transition weights are negative log-probabilities). State 0 is 
the initial state. State 1 corresponds to the word bye and state 3 to the 
word hello. State 2 is the back-off state. 


Let c(w) denote the number of occurrences of a sequence w in the corpus. c(h,;) 
and c(h;w;) can be used to estimate the conditional probability P(w; | hi). 
When c(h;) 4 0, the maximum likelihood estimate of P(w; | hi) is: 


c(hi) 


But, a classical data sparsity problem arises in the design of all n-gram models: 
the corpus, no matter how large, may contain no occurrence of h; (c(h;) = 0). 
A solution to this problem is based on smoothing techniques. ‘This consists of 
adjusting P to reserve some probability mass for unseen n-gram sequences. 

Let P(w; | h;) denote the adjusted conditional probability. A smoothing 
technique widely used in language modeling is the Katz back-off technique. 
The idea is to “back-off” to lower order n-gram sequences when c(h;w;) = 0. 
Define the backoff sequence of h; as the lower order n-gram sequence suffix of 
h; and denote it by hi. h; = whi for some word u. Then, in a Katz back-off 
model, P(w; | hi) is defined as follows: 


P(w;|hi) if c(hiw;) > 0 


Qn, P(w; | hi) otherwise em) 


Pw; | hi) = { 


where q;,, is a factor ensuring normalization. The Katz back-off model admits a 
natural representation by a weighted automaton in which each state encodes a 
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ey:¢/0.8 dx:¢/0.6 


OFC Ter TOE ® 


ae:e/0.2 t:e/0.4 


Figure 4.8. Section of a pronunciation model of English, a weighted 
transducer over the probability semiring giving a compact representation 
of four pronunciations of the word data due to two distinct pronunciations 
of the first vowel a and two pronunciations of the consonant t (flapped or 
not). 


conditioning history of length less than n. As in the classical de Bruijn graphs, 
there is a transition labeled with w; from the state encoding h; to the state 
encoding hiw; when c(h;w;) 4 0. A so-called failure transition can be used to 
capture the semantic of “otherwise” in the definition of the Katz back-off model 
and keep its representation compact. A failure transition is a transition taken at 
state q when no other transition leaving g has the desired label. Figure 4.3.2(a) 
illustrates that construction in the case of a trigram model (n = 3). 

It is possible to give an explicit representation of these weighted automata 
without using failure transitions. However, the size of the resulting automata 
may become prohibitive. Instead, an approximation of that weighted automaton 
is used where failure transitions are simply replaced by ¢-transitions. This turns 
out to cause only a very limited loss in accuracy.°. 

In practice, for numerical instability reasons negative-log probabilities are 
used and the language model weighted automaton is defined in the log semiring. 
Figure 4.3.2(b) shows the corresponding weighted automaton in a very simple 
case. We will denote by 6 the weighted automaton representing the statistical 
grammar. 


4.3.3. Pronunciation model 


The representation of a pronunciation model P(p | w) (or lexicon) with weighted 
transducers is quite natural. Each word has a finite number of phonemic tran- 
scriptions. The probability of each pronunciation can be estimated from a cor- 
pus. Thus, for each word x, a simple weighted transducer T, mapping z to its 
phonemic transcriptions can be constructed. 

Figure 4.8 shows that representation in the case of the English word data. 
The closure of the union of the transducers &,, for all the words x considered 
gives a weighted transducer representation of the pronunciation model. We will 
denote by $8 the equivalent transducer over the log semiring. 


6 An alternative when no offline optimization is used is to compute the explicit represen- 
tation on-the-fly, as needed for the recognition of an utterance. There exists also a complex 
method for constructing an exact representation of an n-gram model which cannot be pre- 
sented in this short chapter. 
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4q:4/0 2Gze:9/0 


Figure 4.9. Context-dependency transducer restricted to two phones p and q. 


4.3.4. Context-dependency transduction 


The pronunciation of a phone depends on its neighboring phones. To design 
an accurate acoustic model, it is thus beneficial to model a context-dependent 
phone, i.e., a phone in the context of its surrounding phones. This has also 
been corroborated by empirical evidence. The standard models used in speech 
recognition are n-phonic models. A context-dependent phone is then a phone in 
the context of its n, previous phones and nz following phones, with nj +n2+1 = 
n. Remarkably, the mapping P(c | d) from phone sequences to sequences of 
context-dependent phones can be represented by finite-state transducers. This 
section illustrates that construction in the case of triphonic models (ny = ng = 
1). The extension to the general case is straightforward. 

Let P denote the set of context-independent phones and let C denote the 
set of triphonic context-dependent phones. For a language such as English or 
French, Card(P) + 50. Let »,pp, denote the context-dependent phone corre- 
sponding to the phone p with the left context p; and the right context po. 

The construction of the context-dependency transducer is similar to that of 
the language model automaton. As in the previous case, for numerical instability 
reasons, negative log-probabilities are used, thus the transducer is defined in the 
log semiring. Each state encodes a history limited to the last two phones. There 
is a transition from the state associated to (p,q) to (q,r) with input label the 
context-dependent phone pq, and output label g. More precisely, the transducer 
T= (C,P,Q,1, F, E, A, p) is defined by: 


© Q={(p,4):PpEP,gePU {e}}U {(e,C)}; 
e I = {(e,C)} and F = {(p,c): pe P}; 
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Figure 4.10. Hidden-Markov Model transducer. 


e BE {((p, Y), e040, 0,107) :Y =qor Y=Cc}t 


with all initial and final weights equal to zero. Figure 4.9 shows that transducer 
in the simple case where the phonemic alphabet is reduced to two phones (P = 
{p, q}). We will denote by € the weighted transducer representing the context- 
dependency mapping. 


4.3.5. Acoustic model 


In most modern speech recognition systems, context-dependent phones are mod- 
eled by three-state Hidden Markov Models (HMMs). Figure 4.10 shows the 
graphical representation of that model for a context-dependent model pq,;. The 
context-dependent phone is modeled by three states (0,1, and 2) each mod- 
eled with a distinct distribution (do,di,d2) over the input observations. The 
mapping P(d | c) from sequences of context-dependent phones to sequences of 
distributions is the transducer obtained by taking the closure of the union of 
the finite-state transducers associated to all context-dependent phones. We will 
denote by § that transducer. Each distribution d; is typically a mixture of 
Gaussian distributions with mean jz and covariance matrix o: 


1 

(27) N/2\o|1/2 ? 
where w is an observation vector of dimension N. Observation vectors are 
obtained by local spectral analysis of the speech waveform at regular intervals, 
typically every 10 ms. In most cases, they are 39-dimensional feature vectors 
(N = 39). The components are the 13 cepstral coefficients, i.e., the energy and 
the first 12 components of the cepstrum and their first-order (delta cepstra) and 
second-order differentials (delta-delta cepstra). The cepstrum of the (speech) 
signal is the result of taking the inverse-Fourier transform of the log of its 
Fourier transform. Thus, if we denote by x(w) the Fourier transform of the 
signal, the 12 first coefficients c,, in the following expression: 


P(w) — —$(w—p)7 07 (wp) (4.3.10) 


log|a(o)= 5” ae (4.3.11) 


n=— Co 


are the coefficients used in the observation vectors. This truncation of the 
Fourier transform helps smooth the log magnitude spectrum. Empirically, cep- 
stral coefficients have shown to be excellent features for representing the speech 
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Ok 


Figure 4.11. Observation sequence O = 0;---ox. The time stamps fi, 
i=0,...k, labeling states are multiples of 10 ms. 


signal.’ Thus the observation sequence 0 = 0; --- ox can be represented by a 
sequence of 39-dimensional feature vectors extracted from the signal every 10 
ms. This can be represented by a simple automaton as shown in figure 4.11, 
that we will denote by 0. 

We will denote by D x § the weighted transducer resulting from the appli- 
cation of the transducer § to an observation sequence D. 0 x H is the weighted 
transducer mapping O to sequences of context-dependent phones, where the 
weights of the transitions are the negative log of the value associated by a dis- 
tribution d; to an observation vector O,, -log d;(O;). 


4.3.6. Combination and search 


The previous sections described the representation of each of the components 
of a speech recognition system by a weighted transducer or weighted automa- 
ton. This section shows how these transducers and automata can be combined 
and searched efficiently using the weighted transducer algorithms previously 
described, following Equation 4.3.4. 

A so-called Viterbi approximation is often used in speech recognition. It 
consists of approximating a sum of probabilities by its dominating term: 


w= argmax > P(o| d)P(d | c)P(c| p)P(p | w)P(w) (4.3.12) 


d,c,p 
© argmax max P(o| d)P(d | c)P(c| p)P(p | w)P(w) (4.3.13) 
w ey 


This has been shown to be empirically a relatively good approximation, though, 
most likely, its introduction was originally motivated by algorithmic efficiency. 
For numerical instability reasons, negative-log probabilities are used, thus the 
equation can be reformulated as: 


w=argmin min — log P(o | d)—log P(d | c)—log P(c | p)—log P(p | w)—log P(w) 


1¢,p 


As discussed in the previous sections, these models can be represented by 
weighted transducers. Using the composition algorithm for weighted trans- 
ducers, and by definition of the x-operation and projection, this is equivalent 


7Most often, the spectrum is first transformed using the Mel Frequency bands, which is a 
non-linear scale approximating the human perception. 
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Figure 4.12. Cascade of speech recognition transducers. 


to: 
w = argminIIn(D * Ho €o Po G) (4.3.14) 


Thus, speech recognition can be formulated as a cascade of composition of 
weighted transducers illustrated by Figure 4.12. wW labels the path of 37 = 
IIn(D * Ho €o Po G) with the lowest weight. The problem can be viewed as 
a classical single-source shortest-paths algorithm over the weighted automaton 
M5. Any single-source shortest paths algorithm could be used to solve it. In 
fact, since O is finite, the automaton 29 could be acyclic, in which case the clas- 
sical linear-time single-source shortest-paths algorithm based on the topological 
order could be used. 

However, this scheme is not practical. This is because the size of 23 can 
be prohibitively large even for recognizing short utterances. The number of 
transitions of O for 10s of speech is 1000. If the recognition transducer T = 
§ 0 €o Po G had in the order of just 100M transitions, the size of 23 would be 
in the order of 1000 x 100M transitions, i.e., about 100 billion transitions! 

In practice, instead of visiting all states and transitions, a heuristic pruning 
is used. A pruning technique often used is the beam search. This consists of 
exploring only states with tentative shortest-distance weights within a beam or 
threshold of the weight of the best comparable state. Comparable states must 
roughly correspond to the same observations, thus states of T are visited in the 
order of analysis of the input observation vectors, i.e. chronologically. This 
is referred to as a synchronous beam search. A synchronous search restricts 
the choice of the single-source shortest-paths problem or the relaxation of the 
tentative shortest-distances. The specific single-source shortest paths algorithm 
then used is known as the Viterbi Algorithm, which is presented in Exercise 
1.3.1. 

The x-operation, the Viterbi algorithm, and the beam pruning techniques 
are often combined into a decoder. Here is a brief description of the decoder. 
For each observation vector o; read, the transitions leaving the current states of 
¥ are expanded, the x-operation is computed on-the-fly to compute the acoustic 
weights given by the application of the distributions to o;. The acoustic weights 
are added to the existing weight of the transitions and out of the set of states 


8Note that the Viterbi approximation can be viewed simply as a change of semiring, from 
the log semiring to the tropical semiring. This does not affect the topology or the weights 
of the transducers but only their interpretation or use. Also, note that composition does not 
make use of the first operation of the semiring, thus compositions in the log and tropical 
semiring coincide. 
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reached by these transitions those with a tentative shortest-distance beyond a 
pre-determined threshold are pruned out. The beam threshold can be used as a 
means to select a trade-off between recognition speed and accuracy. Note that 
the pruning technique used is non-admissible. The best overall path may fall 
out of the beam due to local comparisons. 


4.3.7. Optimizations 


The characteristics of the recognition transducer Y were left out of the previous 
discussion. They are however key parameters for the design of real-time large- 
vocabulary speech recognition systems. The search and decoding speed critically 
depends on the size of { and its non-determinism. This section describes the 
use of the determinization, minimization, and weight pushing algorithm for 
constructing and optimizing @. 

The component transducers described can be very large in speech recognition 
applications. The weighted automata and transducers we used in the North 
American Business news (NAB) dictation task with a vocabulary of just 40,000 
words (the full vocabulary in this task contains about 500,000 words) had the 
following attributes: 


e &: a shrunk Katz back-off trigram model with about 4M transitions;® 


e $8: pronunciation transducer with about 70,000 states and more than 
150,000 transitions; 


e ¢: a triphonic context-dependency transducer with about 1,500 states and 
80,000 transitions. 


e §: an HMM transducer with more than 7,000 states. 


A full construction of { by composition of such transducers without any 
optimization is not possible even when using very large amounts of memory. 
Another problem is the non-determinism of {. Without prior optimization, T is 
highly non-deterministic, thus, a large number of paths need to be explored at 
the search and decoding time, thereby considerably slowing down recognition. 

Weighted determinization and minimization algorithms provide a general 
solution to both the non-determinism and the size problem. To construct an 
optimized recognition transducer, weighted transducer determinization and min- 
imization can be used at each step of the composition of each pair of component 
transducers. The main purpose of the use of determinization is to eliminate 
non-determinism in the resulting transducer, thereby substantially reducing 
recognition time. But, its use at intermediate steps of the construction also 
helps improve the efficiency of composition and reduce the size of the resulting 
transducer. We will see later that it is in fact possible to construct offline the 
recognition transducer and that its size is practical for real-time speech recog- 
nition! 

*Various shrinking methods can be used to reduce the size of a statistical grammar without 
affecting its accuracy excessively. 
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However, as pointed out earlier, not all weighted automata and transducers 
are determinizable, e.g., the transducer $06 mapping phone sequences to words 
is in general not determinizable. This is clear in presence of homophones. But 
even in the absence of homophones, $80 6 may not have the twins property and 
be non-determinizable. To make it possible to determinize 8 o 6, an auxiliary 
phone symbol denoted by #9 marking the end of the phonemic transcription of 
each word can be introduced. Additional auxiliary symbols #; ...#%—1 can be 
used when necessary to distinguish homophones as in the following example: 


rehd #o0_ read 
rehd #,_ red 


At most D auxiliary phones, where D is the maximum degree of homophony, 
are introduced. The pronunciation transducer augmented with these auxiliary 
symbols is denoted by $8. For consistency, the context-dependency transducer 
€ must also accept all paths containing these new symbols. For further deter- 
minizations at the context-dependent phone level and distribution level, each 
auxiliary phone must be mapped to a distinct context-dependent phone. Thus, 
selfloops are added at each state of € mapping each auxiliary phone to a new 
auxiliary context-dependent phone. The augmented context-dependency trans- 
ducer is denoted by €. 

Similarly, each auxiliary context-dependent phone must be mapped to a new 
distinct distribution. D self-loops are added at the initial state of § with aux- 
iliary distribution input labels and auxiliary context-dependency output labels 
to allow for this mapping. The modified HMM transducer is denoted by 9. 

It can be shown that the use of the auxiliary symbols guarantees the de- 
terminizability of the transducer obtained after each composition. Weighted 
transducer determinization is used at several steps of the construction. An n- 
gram language model 6 is often constructed directly as a deterministic weighted 
automaton with a back-off state — in this context, the symbol ¢ is treated as 
a regular symbol for the definition of determinism. If this does not hold, 6 is 
first determinized. 8 is then composed with 6 and determinized: det(P o 6). 
The benefit of this determinization is the reduction of the number of alternative 
transitions at each state to at most the number of distinct phones at that state 
(= 50), while the original transducer may have as many as V outgoing transi- 
tions at some states where V is the vocabulary size. For large tasks where the 
vocabulary size can be more than several hundred thousand, the advantage of 
this optimization is clear. 

The inverse of the context-dependency transducer might not be determin- 
istic.!° For example, the inverse of the transducer shown in Figure 4.9 is not 
deterministic since the initial state admits several outgoing transitions with the 
same input label p or g. To construct a small and efficient integrated transducer, 
it is important to first determinize the inverse of ¢€.'! 

10The inverse of a transducer is the transducer obtained by swapping input and output 
labels of all transitions. 


11 Triphonic or more generally n-phonic context-dependency models can also be constructed 
directly with a deterministic inverse. 
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¢ is then composed with the resulting transducer and determinized. Simi- 
larly § is composed with the context-dependent transducer and determinized. 
This last determinization increases sharing among HMM models that start with 
the same distributions: at each state of the resulting integrated transducer, 
there is at most one outgoing transition labeled with any given distribution 
name. This leads to a substantial reduction of the recognition time. 

As a final step, the auxiliary distribution symbols of the resulting trans- 
ducer are simply replaced by ¢’s. The corresponding operation is denoted by 
II-. The sequence of operations just described is summarized by the following 
construction formula: 


M = I-(det(H o det(€ o det(P o B)))) (4.3.15) 


where parentheses indicate the order in which the operations are performed. 
Once the recognition transducer has been determinized, its size can be further 
reduced by minimization. The auxiliary symbols are left in place, the minimiza- 
tion algorithm is applied, and then the auxiliary symbols are removed: 


N = I. (min(det(H o det(€ o det(P o B))))) (4.3.16) 


Weighted minimization can also be applied after each determinization step. 
It is particularly beneficial after the first determinization and often leads to 
a substantial size reduction. Weighted minimization can be used in different 
semirings. Both minimization in the tropical semiring and minimization in the 
log semiring can be used in this context. The results of these two minimiza- 
tions have exactly the same number of states and transitions and only differ 
in how weight is distributed along paths. The difference in weights arises from 
differences in the definition of the key pushing operation for different semirings. 

Weight pushing in the log semiring has a very large beneficial impact on 
the pruning efficacy of a standard Viterbi beam search. In contrast, weight 
pushing in the tropical semiring, which is based on lowest weights between 
paths described earlier, produces a transducer that may slow down beam-pruned 
Viterbi decoding many fold. 

The use of pushing in the log semiring preserves a desirable property of 
the language model, namely that the weights of the transitions leaving each 
state be normalized as in a probabilistic automaton. Experimental results also 
show that pushing in the log semiring makes pruning more effective. It has 
been conjectured that this is because the acoustic likelihoods and the transducer 
probabilities are then synchronized to obtain the optimal likelihood ratio test for 
deciding whether to prune. It has been further conjectured that this reweighting 
is the best possible for pruning. A proof of these conjectures will require a careful 
mathematical analysis of pruning. 

The result St is an integrated recognition transducer that can be constructed 
even in very large-vocabulary tasks and leads to a substantial reduction of the 
recognition time as shown by our experimental results. Speech recognition is 
thus reduced to the simple Viterbi beam search described in the previous section 
applied to Nt. 
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In some applications such as for spoken-dialog systems, one may wish to 
modify the input grammar or language model 6 as the dialog proceeds to ex- 
ploit the context information provided by previous interactions. This may be 
to activate or deactivate certain parts of the grammar. For example, after a 
request for a location, the date sub-grammar can be made inactive to reduce 
alternatives. 

The offline optimization techniques just described can sometimes be ex- 
tended to the cases where the changes to the grammar 6 are pre-defined and 
limited. The grammar can then be factored into sub-grammars and an op- 
timized recognition transducer is created for each. When deeper changes are 
expected to be made to the grammar as the dialog proceeds, each component 
transducer can still be optimized using determinization and minimization and 
the recognition transducer St can be constructed on-demand using an on-the-fly 
composition. States and transitions of St are then expanded as needed for the 
recognition of each utterance. 

This concludes our presentation of the application of weighted transducer 
algorithms to speech recognition. There are many other applications of these 
algorithms in speech recognition, including their use for the optimization of the 
word or phone lattices output by the recognizer that cannot be covered in this 
short chapter. 

We presented several recent weighted finite-state transducer algorithms and 
described their application to the design of large-vocabulary speech recognition 
systems where weighted transducers of several hundred million states and tran- 
sitions are manipulated. The algorithms described can be used in a variety of 
other natural language processing applications such as information extraction, 
machine translation, or speech synthesis to create efficient and complex sys- 
tems. They can also be applied to other domains such as image processing, 
optical character recognition, or bioinformatics, where similar statistical models 
are adopted. 


Notes 


Much of the theory of weighted automata and transducers and their mathe- 
matical counterparts, rational power series, was developed several decades ago. 
Excellent reference books for that theory are Eilenberg (1974), Salomaa and 
Soittola (1978), Berstel and Reutenauer (1984) and Kuich and Salomaa (1986). 

Some essential weighted transducer algorithms such as those presented in 
this chapter, e.g., composition, determinization, and minimization of weighted 
transducers are more recent and raise new questions, both theoretical and algo- 
rithmic. These algorithms can be viewed as the generalization to the weighted 
case of the composition, determinization, minimization, and pushing algorithms 
described in Chapter 1 Section 1.5. However, this generalization is not always 
straightforward and has required a specific study. 

The algorithm for the composition of weighted finite-state transducers was 
given by Pereira and Riley (1997) and Mohri, Pereira, and Riley (1996). The 
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composition filter described in this chapter can be refined to exploit information 
about the composition states, e.g., the finality of a state or whether only «- 
transitions or only non ¢-transitions leave that state, to reduce the number of 
non-coaccessible states created by composition. 

The generic determinization algorithm for weighted automata over weakly 
left divisible left semirings presented in this chapter as well as the study of 
the determinizability of weighted automata are from Mohri (1997). The deter- 
minization of (unweighted) finite-state transducers can be viewed as a special 
instance of this algorithm. The definition of the twins property was first formu- 
lated for finite-state transducers by Choffrut (see Berstel (1979) for a modern 
presentation of that work). The generalization to the case of weighted automata 
over the tropical semiring is from Mohri (1997). A more general definition for 
a larger class of semirings, including the case of finite-state transducers, as well 
as efficient algorithms for testing the twins property for weighted automata and 
transducers under some general conditions is presented by Allauzen and Mohri 
(2003). 

The weight pushing algorithm and the minimization algorithm for weighted 
automata were introduced by Mohri 1997. The general definition of shortest- 
distance and that of k-closed semirings and the generic shortest-distance algo- 
rithm mentioned appeared in Mohri (2002). Efficient implementations of the 
weighted automata and transducer algorithms described as well as many oth- 
ers are incorporated in a general software library, AT&T FSM Library, whose 
binary executables are available for download for non-commercial use (Mohri 
et al. (2000)). 

Bahl, Jelinek, and Mercer 1983 gave a clear statistical formulation of speech 
recognition. An excellent tutorial on Hidden Markov Model and their applica- 
tion to speech recognition was presented by Rabiner (1989). The problem of the 
estimation of the probability of unseen sequences was originally studied by Good 
1953 who gave a brilliant discussion of the problem and provided a principled 
solution. The back-off n-gram statistical modeling is due to Katz (1987). See 
Lee (1990) for a study of the benefits of the use of context-dependent models in 
speech recognition. 

The use of weighted finite-state transducers representations and algorithms 
in statistical natural language processing was pioneered by Pereira and Riley 
(1997) and Mohri (1997). Weighted transducer algorithms, including those de- 
scribed in this chapter, are now widely used for the design of large-vocabulary 
speech recognition systems. A detailed overview of their use in speech recogni- 
tion is given by Mohri, Pereira, and Riley (2002). Sproat 1997 and Allauzen, 
Mohri, and Riley 2004 describe the use of weighted transducer algorithms in the 
design of modern speech synthesis systems. Weighted transducers are used in a 
variety of other applications. Their recent use in image processing is described 
by Culik II and Kari (1997). 
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5.0. Introduction 


This chapter introduces various mathematical models and combinatorial algo- 
rithms for inferring network expressions that appear repeated in a word or are 
common to a set of words, where by network expression is meant a regular ex- 
pression without Kleene closure on the alphabet of the input word(s). A network 
expression on such an alphabet is therefore any expression built up of concate- 
nation and union operators. For example, the expression A(C + G)T concatenates 
A with the union (C+ G) and with T. Inferring network expressions means dis- 
covering such expressions which are initially unknown. The only input is the 
word(s) where the repeated (or common) expressions will be sought for. This is 
in contrast with another problem we shall not be concerned with, that consists 
in searching for a known expression in a word(s) both of which are in this case 
part of the input. The inference of network expressions has many applications, 


Version June 23, 2004 


228 Inference of Network Expressions 


notably in molecular biology, system security, text mining etc. Because of the 
richness of the mathematical and algorithmic Al problems posed by molecular 
biology, we concentrate on applications in this area. The network expressions 
considered may therefore contain spacers where by spacer is meant any number 
of don’t care symbols (a don’t care is a symbol that matches anything). Con- 
strained spacers are consecutive don’t care symbols whose number ranges over 
a fixed interval of values. Network expressions with don’t care symbols but no 
spacers are called “simple” while network expressions with spacers are called 
“flexible” if the spacers are unconstrained, and “structured” otherwise. Both 
notions are important in molecular biology. Applications to biology motivate 
us also to consider network expressions that appear repeated not exactly but 
approximately. 

Only exact combinatorial methods that are non-trivial (that is, are not sim- 
ple brute-force schemes which enumerate all possible network expressions) will 
be mentioned. In most cases, the network expressions that have been considered 
in the literature present some constraint that generally applies to the union op- 
erator. Indeed, the operands concerned by the union operation will most often 
be elements of the alphabet A and not arbitrary words in AT as is the case with 
unrestricted network expressions. For instance, we do not allow expressions such 
as A(CG+ G)T. 

The literature on the inference of regular expressions, also called grammatical 
inference, is vast, and predates computational biology by many years. The 
inference problems addressed in this chapter present special characteristics in 
relation to such general problems. The most important ones are that, although 
the expressions considered here are simpler in the sense indicated above, their 
occurrences are not exact and come hidden inside often very large texts. To use 
the terms commonly adopted in the grammatical inference community, we work 
with (positive) examples that have first to be fished from a sea of other textual 
information, consisting mostly in noise. Most often, there is not one regular 
expression only, and thus one set of examples, but various distinct ones hidden 
in the same text. 


5.1. Inferring simple network expressions: models and re- 
lated problems 


5.1.1. The star model 


A star expression X is an expression of the form X = e1€2---@m where each 
e; is the union of elements of the alphabet, ie. e; = ai + dig +++: + Gin; 
with a;,; € A for 1 <j <n; < Card(A), 1 < i < m. Star expressions are 
therefore words on the alphabet P(A) of all non empty subsets of A, that is, 
they are elements of P(A)*. This includes the set A which we denote by e and 
call the don’t care symbol’. Let F(w) denote the set of factors of a word w. A 
star expression denotes also a finite set of words of length m. A star expression 


1One also finds in the literature the terms wild card and joker. 
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X € P(A)* is said to occur exactly in a word w € A?* if there exists v € F(w) 
such that v € X. The factor v is said to be an occurrence of X in w. 

The notion of approximate occurrence relies on the notion of a distance 
between two words u and v on A. In biology, like in a number of other text 
applications, a natural distance measures the effort required to transform one 
word into the other given certain allowed operations. The operations that model 
best the mutational events that may happen during replication and survive are 
the substitution (i.e. replacement) of a letter of A by another, the deletion of a 
letter in one of the two words, and the insertion of a letter. Finally, a match is 
an operation that leaves the letter unchanged. These are called edit operations. 

Let S,D,I,M denote the four edit operations described above. An edit 
transcript is a string over the alphabet S, D, I, M that describes a transformation 
of a word u into another v. An equivalent way of describing such transformation 
is through a global alignment of u and v. This is obtained by inserting spaces 
in both u and v, transforming them into u’ and v’ defined over AU {—} where 
{—} denotes a space and |u’| = |v’|. An edit transcript can be easily converted 
into a global alignment and vice-versa. 

A cost c may be attributed to each operation where c is a function (AU{—}) x 
(AU {—-}) SR. The cost of a global alignment (and thus of the associated edit 
transcript) of two words u’ and v’ of length n is then SS c(u'[i], v'[2]). Not 
all cost functions define a distance. This will be the case if the function c is 
symmetric and if c(a,b) for a,b € AU {—} is strictly greater than 0 if a 4 b and 
is 0 otherwise. 

Two types of distances have attracted special attention. These are the Ham- 
ming and the edit distance (this last is also called the Levenshtein distance). 

In the case of the edit distance, the cost function is 


0 otherwise. 


c(a,8) = { 4 sae for a,b € AU {-} 


The edit distance is thus the minimum number of substitutions, insertions 
and deletions required to transform u into v. The Hamming distance applies 
only to words of same length, and is restricted to substitution/match operations. 
The cost function is the same as for the edit distance applied to a,b € A. 
It counts the minimum number of substitutions needed to obtain v from u 
assuming they have the same length. The Hamming distance will be denoted 
by disty and the edit distance by distz. 

Given a positive integer d, a d-occurrence of X in a word w is a word v € 
Fw) such that there exists a word u € X with dist(u,v) < d for some fixed d. 
An expression X € P(A) is thus said to appear approximately in w if it has a 
d-occurrence in w. Where there is no possible ambiguity, reference to d will be 
dropped in all such notations. 

The distances considered between words v and wu are, as suggested, usually 
Hamming or edit, although any other may be used. For ease of exposition, we 
consider Hamming distance exclusively. Issues related to the use of the edit 
distance instead are left as open problems in Section 5.5.1. 
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The above definitions of approximate occurrence of an expression lead to the 
following inference problem statements. They call upon the concept of quorum. 
This is the minimum number of times an expression must appear repeated in 
the input word(s). In the case of a set of words, the quorum is the number of 
distinct words where the expression appears. 


Expression inference problem for a star expression 


Expression X repeated in a word 
INPUT: A word w € A™, a quorum g and a distance d. 


OUTPUT: All star expressions X € P(.A)* that occur d-appro- 
ximately in w at least q times. 


Expression X common to a set of words 
INPUT: N words wy}, w2,...,wn € At, a quorum gq, a distance d. 


OUTPUT: All star expressions X € P(A)t that occur d-appro- 
ximately in at least q of the N words wy, we2,...,wn. 


Observe that nothing, except algorithmic Al concerns, would forbid to con- 
sider in all the above definitions and problem statements, network expressions 
without constraints on the union operator. 

These definitions and statements have been adopted in a number of exact 
algorithms, namely COMBI, POIVRE, SPELLER, SMILE, PRATT, and MITRA- 
COUNT. The main difference between the algorithms has been in the type of 
further constraints put on the network expressions allowed. In COMBI, the 
expressions are indeed elements of P(.A)* with just a constraint on the number 
of times the don’t care symbol A may be used in the expression. This last 
constraint is used by all algorithms. In POIVRE as in PRATT, the expressions are 
over an alphabet S where S is a proper subset of P(A). In PRATT furthermore, 
expressions must appear exactly in a word (d = 0). In SPELLER and MITRA- 
COUNT, the expressions are elements of At. SPELLER was later extended to 
handle elements of S C P(A)* as is the case for SMILE. 

The model described in this section is called a star model because the ex- 
pressions X given as output may be viewed as the center of star trees whose 
edges have as length the distance between X and each of its occurrences in the 
input word(s). A special case of the problem of finding such expressions has 
been called the Closest substring problem. This is stated in the following way, 
where by Fi,(w) we denote the set F(w) 9 A* of w having length k. 


Closest substring problem 


INPUT: N words wy, w2,...,wy over the alphabet A and integers d and 
k. 


OUTPUT: A word « of length k over A such that there exist u; € Fy (wi) 
with disty(x,u;) <d for alll <i<N. 
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It is interesting to observe the relation between an expression within the star 
model and another well known mathematical object: Steiner strings. Given a 
set of words 21,...,2y of same length in At, a Steiner string is a word Z also 
in At which minimizes oy dist (Z,2;). One may then wonder whether an 
expression X that solves the Expression inference problem for star expressions 
leads to a Steiner string, that is, are the star expressions found also solutions of 
the Steiner string problem for their sets of occurrences? The answer is negative. 
A simple counter-example is the star expression ACAA repeated in the word 
w = AAAAAACAC with d = 1 and q = 4. The expression ACAA of length 4 is indeed 
a solution of the star expression inference problem since it has 4 occurrences in 
w, namely at positions 0,1, 2,5 (positions in a word start at 0). Expression AAAC 
is also a solution with 5 occurrences, at positions 0,1, 2,3,5. Neither expression 
is a Steiner string of its set of occurrences. In both cases, this is the word AAAA. 
Notice that AAAA is not itself a solution of the problem. 

The star-model admits a variant that is interesting for some biological appli- 
cations. The variant applies to the case of expressions common to a set of words 
and assumes a phylogenetic tree is given as input together with the words. This 
is a binary tree that represents the speciation events that have led to the dif- 
ferent species currently existing (each represented by a word in the input words 
set and associated to a leaf of the tree) from the ancestors that are not known. 
The tree may be unrooted, or rooted if the order of the events in time is known. 
We assume here that the tree is rooted. It is this tree, and not a star-tree, that 
is used to compute the distances. 

In fact, each edge in the tree is labeled by the Hamming distance between 
the words at its extremities. Given a phylogenetic tree, a set of words placed at 
the leaves, and an integer k, factors of length k, one for each input word, and 
intermediate expressions corresponding to internal nodes have to be inferred 
such that the sum of the labels over all edges of the tree is minimized. Such 
minimal sum is called parsimony score. The expressions are elements of AT, the 
problem (in its decision version) as addressed by the FOOTPRINTER algorithm 
is as follows. 


Substring parsimony problem 


INPUT: N words wi, we,...,wy € At, a phylogenetic tree F for the 
words, a length &, and an integer d. 


OUTPUT: Factors of length k for the leaves and words of length k for 
all internal nodes of the tree that have a parsimony score at most d. 


Another variant that has been considered constrains the expressions given 
as solutions to the expression inference problem to satisfy an uniform property. 
Suppose expressions of a length & are sought and let X be a solution of the 
problem. Let also two positive integers, d’ < d and k’ < k, be given. The 
following must then be true: for all i such that 0 <7 < |X|, disty(X[i..i+ k’ — 
1], v[t..2 + k’ — 1]) < d’. Intuitively, this constraint imposes that the possible 
differences between an expression X solution of the problem and each of its 
occurrences be uniformly spread, hence the name given to the property. A 


Version June 23, 2004 


232 Inference of Network Expressions 


further variant has been used in WEEDER where the constraint that must be 
satisfied is: for all i such that 0 < i < |X|, distq(X(0..7], v[0..i]) < [44]. 
The inconvenient of a constraint of this latter type is that an asymmetry is 
introduced: differences may accumulate at the end of the occurrences of an 
expression. 


5.1.2. The clique model 


In this model, given an alphabet A, the network expressions on A that are con- 
sidered have the constraint that union operators may this time be applied to 
elements of A* only, where k is the length of the expressions sought. Such ex- 
pressions are therefore collections of words w € A*, and the notion of occurrence 
is associated to both a distance and a quorum. The definition of occurrence in 
this case is thus operational and includes within it the problem statement. 

Such an alternative model has been used by some authors, most notably in 
the algorithms WINNOWER, MITRA-GRAPH and KMRC. MITRA-GRAPH uses 
in fact both models: the clique model and the star. WINNOWER and MITRA- 
GRAPH formalize the model and problem in graph-theoretical terms. 

We state the problem in the case of a set of words. Given a set of N words 
W 1, W2,--.,Wy over the alphabet A and two non negative integers d and k, let 
G = (Vi U--- U Vn, E) be an N-partite graph where V; = F(w;)N A* (i.e, it 
is the set of all factors of length & of w; for all 7) and there is an edge between 
vu; € Vi,v; € V; fori # 7 if disty(w;,w;) < 2d. Given d and k as above, and 
given a quorum q, we say that a network expression X is a (d,q)-clique if and 
only if it is a set of g words of length k such that (u,v) € E for all u,v € X. 

The problem is then formulated as follows. 


Expression inference as a clique problem 


INPUT: N words wi, we,...,wn € At, a quorum q, a length k, and a 
distance d; the associated N-partite graph G. 


OUTPUT: All (d, q)-cliques of G. 


The output cliques correspond to expressions X having occurrences that 
are all pairwise 2d-approximations” of each other in at least q of the N words 
W 1, W2,-..,wWn. Expression X is the union of its occurrences. 

In POIvRE instead, the graph is built upon a relation R between the letters of 
the alphabet A that is also part of the input. Typically, the relation will be non 
transitive and model the degree to which shared physico-chemical properties 
between the biological units denoted by the letters (nucleic or amino acids) 
enables them to perform equivalent functions in a molecule. The relation R on 
the letters of A is straightforwardly extended to a relation R on words of the 
same length in At as follows: two factors u,v of length k are in relation by 
R extended if u[?] is in relation with v[i] by R, for 0 <i<k-—1. Relation 
R extended to factors of length k is denoted by Rx. The problem is then 


?1t will be explained later in this section why a 2d threshold is used instead of simply d. 
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expressed in a way that resembles the formulation given above except that 
graph G = (Vi U...U Vw, E) is now such that there is an edge between nodes 
ui € Vi,v; € V; for i # 7 if the corresponding factors in w;,w; are in relation 
by Rx. In PoOIvReE, the idea was used to infer contiguous motifs in protein 
structures previously coded into a string of pairs of angles whose values are 
discretized into integers by means of a grid. 

A natural question is whether for each solution X of the expression inference 
as a clique problem in the case of WINNOWER, there exists an expression Y such 
that |Y| = |X| and Y is a solution under the star model for the collection of 
words that is the set of occurrences of X, distance d and same quorum. The 
answer is no. Let us assume expressions in A* only are considered. Let the 
input words be the set {w; = ACAC, w2 = AGAG, w; = ATAT}, d= 1 and q = 3. 
The expressions in A* of length 4, X,; = ACAC, Xj = AGAG and X3 = ATAT are 
solutions of the Expression inference as a clique problem as the (non proper) 
factors w, = ACAC,w2 = AGAG,w3 = ATAT in the three input words w; form 
a clique of the 3-partite graph G. Yet no expression Y in A* exists that is a 
solution of the problem as formulated in the star model for gq Obviously, there 
are solutions for distance d. In that case however, there may be more solutions 
in the star model, some of which have more occurrences. Consider this time 
the following set of input words {w; = ACAC,we = AGAG,ws3 = ATAT,w4, = 
TCTG} and d,q as before. Expression X = ACAC|AGAG|ATAT remains a solution 
within the clique model for quorum 3. Within the star model and with g = 1 
any expression of the type AbAc, with b,c € A, b # c¢, is a solution with 
three occurrences while expression Y = ACAG is also a solution but with four 
occurrences (w 1, W2, W3, W4). 


5.1.3. Other models 


Other models have been used, in general for inferring expressions that are either 
elements of At or sets of elements of A* for a given positive integer k. In the 
case of expressions in At, those sought are the “most surprising” ones in the 
statistical sense. They correspond to expressions whose probability of occurring 
(exactly or approximately) in the input word(s) is lower than expected assuming 
a certain statistical model that is in general a Markov model of order p of 
the input word(s) for p < (k —1). The algorithms for computing the “most 
surprising” expressions either perform brute-force enumeration of all possible 
expressions and then sophisticated exact or approximate statistical evaluation 
(described in detail in Chapter 6), or are heuristic (e.g PROFILE). We therefore 
do not speak of this model further. 

In the case of expressions that are sets of elements of A* for a given positive 
integer k, it is worth mentioning that another measure has been used to decide 
whether a set of factors in some input word(s) should be grouped. The measure 
corresponds to what has been called the relative entropy of a set of words or 
Kullback-Leibler information number. This measure is not a distance (triangular 
inequality is not satisfied) and is global: it is not built upon a pairwise relation 
between the factors and therefore it does not lead to a graph-theoretical for- 
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mulation as above. Algorithms that seek network expressions that are elements 
of At can use the relative entropy of the sets of occurrences of the expressions 
given as output as a measure of “surprise” that will be different from the mea- 
sure given by the probability of occurrence of the expression under the same 
conditions as it was inferred. 


5.2. Algorithms 


5.2.1. Inference in the star model 
5.2.1.1. Preliminaries: suffix tree and generalized suffix tree 


Long words, specially when they are defined over a small alphabet, may contain 
many exact repetitions. From this observation follows the idea of using an 
indexed representation of the input word(s). Using a representation that has 
a unique pointer for identical factors enables to avoid comparing such factors 
more than once with the expressions as these are inferred. The index used by 
the algorithm whose description follows is a suffix tree. 

Details on the suffix tree construction may be found in 2. We just recall 
below that the suffix tree of an input word w, denoted by J, or simply J when 
the input word is clear from the context, has the following properties: 


1. each edge of the suffix tree is labelled by a non empty factor of the input 
word w; 

2. each internal node of the suffix tree has at least two edges leaving it; 

3. the factors labelling distinct edges that leave a same node start with dis- 
tinct letters; 

4. the label of each root-to-leaf path in the suffix tree represents a suffix of 
the input word and the label of each root-to-node path represents a factor; 

5. each suffix of w is associated with the label of a (unique) root-to-leaf path 
in the tree. 


Furthermore, an edge links the node spelling ax to the node spelling x for 
every a € A and x € A*. Such edges are called suffix links and are what allows 
the tree to be built in time linear with respect to the length of the word. 

When the input consists in a set of words W = {wi,we,...,ww}, a tree 
called generalized suffix tree is used to represent in a compact way all the suffixes 
of the set of words. A generalized suffix tree is constructed in a way very similar 
to the suffix tree for a single word. We denote such generalized trees by GT w or 
simply GT when the input words are clear from the context. Generalized suffix 
trees have properties similar to those of a suffix tree with word w substituted 
by the set of words W. In particular, a generalized suffix tree GT satisfies the 
fact that every suffix of every word w, in the set leads to a distinct leaf. When 
several words share a suffix, the generalized suffix tree must have as many leaves 
corresponding to the suffix, each associated with a different word. To achieve 
this property during construction requires simply concatenating to each word 
w; a symbol that is not in A and that is specific to that word. 
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5.2.1.2. SPELLER algorithm 


We describe in this section the original SPELLER algorithm. The algorithm was 
later extended to allow for more general network expressions (over P(A)* and 
not just At) and its performance was improved (see Section 5.3.2.1). 

For ease of exposition, the algorithm is first described for inferring expres- 
sions of fixed length that are repeated in a single word. In fact, the length of 
the expressions output by the algorithm may range over an interval (kmin, kmazx) 
with possibly kmaz = co. In this last case, the longest expressions output will 
be those still satisfying the quorum. When kynin = kmax = 00, only the longest 
expressions satisfying the quorum are output. It is relatively straightforward to 
modify the algorithm to treat any of these cases, or to infer expressions common 
to a set of words, as will be briefly indicated. 

SPELLER uses a suffix tree representation of the input word. Actually, it 
builds (at the same cost) a suffix tree with an additional information attached 
to each non-leaf node indicating the number of leaves in the subtree rooted at 
that node. This is also the number of occurrences in w of the factor spelled by 
the path from the root to the node. Denoting both node and factor by v, the 
information added to node v is denoted by €(v). 

Candidate expressions in AT are for convenience processed in lexicographical 
order, starting from the empty word ¢. For each candidate expression, say 2, 
all pointers to nodes spelling d-approximate occurrences of « are kept (we say 
the nodes themselves are (node-)occurrences of a). Let 


occ(a,t) = {y € F(w) | du(y, x) = 4} 


and let 
d 


OCC; = U occ(2, 2) 
i=1 

be the set of occurrences of 2. Possibly, some such sets are empty. Let ¢(x) = 
Po yeates €(y). The candidate expression x is processed as long as ¢(occ,) > q. 
If x has reached the length k, it is output, otherwise its possible extensions 
are considered. Let xa be its first extension (recall that extensions are at- 
tempted in lexicographical order) for a € A. The occurrences of « belonging to 
occ(a, 0)Uoce(a, 1)Uocc(a, 2) ...Uocc(a, d—1) are also occurrences of xa. On the 
other hand, among the occurrences of « that belong to the set occ(«,d) (their 
Hamming distance from is already the maximum allowed), only those followed 
by a in w may be occurrences of wa. The procedure of extension of a candidate 
expression is applied recursively. Clearly, if a given candidate expression x« does 
not satisfy the quorum anymore, it is useless to extend it. 

A pseudo code for the algorithm SPELLER is given below. It assumes the 
suffix tree J of w has already been built. We define: 


oce(#,t)a = {y € occ(z, 2) | ya € F(w)} 


(it is the subset of occ(a,7) followed by the letter a). 
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INITIALIZESPELLER() 
1 gee 
2 p> by convention, the empty word occurs everywhere in w 
3 > with Hamming distance 0 
4 occ(e,0) — {0,1,...,]w] — 1} 
5 fori-1toddo 
6 oce(e,i) — 0 


SPELLER(2, w, q, k, d) 
1 if ¢(7) >q then 


2 if |z| =k then 

3 OUTPUT — OUTPUT U {2, occ. } 

4 else for a in A do 

5 for i — d downto 0 do 

6 occ(xa, 7) — occ(2, t)a 

6 if i>1 then 

8 oce(xa, 4) — occ(a, t)qU 


(oce(x, i — 1) \ oce(x, i — 1)a) 
9 SPELLER(aa, w, q, k, d) 
10 return OUTPUT 


The time complexity of SPELLER is in O(nVy(d,k)) where n is |w| and 
Vu(d,k) is the size of the set containing all words of length k at Hamming 
distance d from another of length k. We have that Vy(d,k) < k¢Card(A)?. 
Therefore, SPELLER is linear in the input size, but possibly exponential with 
respect to d. It has linear space complexity. When d = 0, SPELLER has linear 
(optimal) time complexity. 

When the length of the expressions sought is given as a range of values 
(kmin, kmax), the algorithm continues extending candidates as long as they do 
not reach kina. Any candidate expression x having already reached length kin 
that satisfies the quorum is output. 

In the case where SPELLER is extended to handle expressions in S C P(A)*, 
a € A in line 4 just needs to be replaced by S € S while x and xa in lines 6, 8, 
and 9 are replaced, respectively, by X and XS. 

SPELLER can also be applied to infer expressions in At common to a set 
of words. As mentioned, a generalized suffix tree GT is used in this case for 
representing all the suffixes of the input words. When we are dealing with N 
words, it is not enough anymore to know the value of ¢(v) for each node v in 
GT in order to be able to check whether an expression satisfies the quorum. 
Indeed, for each node v, we need this time to know, not the number of leaves 
in the subtree of GT having v as root, but that number for each different word 
the leaves refer to. 

In order to do that, we associate to each node v in GT a boolean array bit, 
of size N, that is defined by: 


Version June 23, 2004 


5.2. Algorithms 237 


1, if at least one leaf in the subtree rooted at x 
bit, [2] = represents a suffix of w; 
0, otherwise 


for l<i<N. 

Let ¢'(a) be the total number of cells set to 1 in the boolean array that 
results from the OR of bit, for all nodes v that are occurrences of x in GT. 
This corresponds to the number of distinct input words where x occurs. The 
algorithm then changes only in that the condition to be satisfied now is ¢’(x) > q. 

The time complexity in this case is in O(nN?Vp(d,k)) if n is the length of 
each input word (assuming to simplify that they have same length). The space 
complexity is O(nN?k). 


5.2.1.3. MITRA-COUNT algorithm 


The MITRA-COUNT algorithm proceeds in exactly the same way as SPELLER 
except that MITRA-COUNT works directly on the input word(s) and not on an 
index of the word(s) in the form of a (generalized) suffix tree. The time and space 
cost of building the suffix tree is thus saved. Another advantage of the approach 
is that the positions of the occurrences of the expressions can be kept naturally 
ordered as the expressions are recursively extended. This is also a characteristic 
of earlier algorithms like COMBI and POIVRE which work in essentially the same 
manner as MITRA-COUNT. On the other hand, the inference step is less efficient 
both in terms of time (if a factor has multiple copies in the input word(s), it will 
be processed as many times as it has copies) and of space (for the same reason, 
factors with multiple copies that are occurrences of an expression will need an 
equal number of pointers to them). 


5.2.1.4. FOOTPRINTER algorithm 


FOOTPRINTER has a completely different approach from SPELLER, or from the 
other approaches that will be described in this chapter. It can address only the 
problem of inferring expressions common to a set of words. Unique among all 
the approaches, it also needs as input a phylogenetic tree besides a set of words. 
In a simplified way, a phylogenetic tree, that we denote by Ff, is a tree describing 
the speciation events that have lead to the species currently observed, or to those 
having existed in the past. It is a tree with values attached to the edges whose 
topology represents the evolutionary relations between the species (current or 
ancestral) and whose nodes correspond to the species. The value of an edge 
indicates the evolutionary distance separating the species labelling the nodes at 
the edge’s extremities. Each possible set of factors, one taken from each input 
word, will be considered, that is, placed at the leaves of the input phylogenetic 
tree. The parsimony score of the tree is then calculated before deciding whether 
the set, and the expression at the root of the tree, are a solution of the substring 
parsimony problem. The problem is known to be NP-hard. 

Only expressions of a single fixed length k are addressed by FOOTPRINTER. 
Extension to a range of length values is not straightforward: in practice, the 
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algorithm has to be run again for each different length required for the output 
expressions. 

The algorithm couples a straightforward dynamic programming technique 
with the use of a table tab, containing Card(A)* entries for each node v of F, 
including the leaves. All sets of factors are thus treated together. Each entry x 
(with 2 € A*) in the table corresponds to one possible word of length k to be 
assigned to node v, and contains the value of the best parsimony score that can 
be achieved for the subtree rooted at v, if node v is labeled with «. Denote by 
C(v) the set of children of node v in F. Then table tab, can be computed for 
all nodes v of F starting from the leaves can be computed by performing the 
steps indicated in the algorithm FOOTPRINTER below. The quorum is assumed 
to be N. 


FOOTPRINTER(F, wi, W2,...,wWn, k, d) 
1 for all nodes v € F starting from the leaves do 
2 for all x € A* do 


3 if v is a leaf of F then 

4 > let wy be the input word placed at leaf v in F 
5 if x is a factor of w, then 

6 tab, [x] — 0 

7 else tab,[x] — +00 

8 else tab,[z] — Yyvec(v) MiNycar(tabuly] + dista(z,y)) 
9 return {x € AP | tab;oorlx] < d} 


The algorithm has a structure that resembles the structure of the Fitch 
algorithm for the so-called small parsimony problem. It proceeds from the leaves 
up to the root looking for the optimum at each level up, and then, once the 
root has been reached from all leaves, goes down the phylogenetic tree again 
to recover the values at each internal node and leaf that actually produced all 
optimal parsimony solutions that are below d. 

It is possible to use a quorum lower than JN, giving rise to the so called sub- 
string parsimony problem with losses. The basic idea is the following. One as- 
sumes the evolution time along the edges of the phylogenetic tree is also known, 
and the quorum is in this case expressed as a minimum total evolutionary time 
summed over all edges in the subtree containing as leaves the factors that are 
occurrences of a same expression. An expression may therefore have less than N 
occurrences, but the occurrences must then span a “wide-enough” evolutionary 
time, that is, they must concern organisms that are “distant enough” in terms 
of evolution. 


5.2.2. Inference as a clique detection problem 


5.2.2.1. WINNOWER algorithm 


The algorithm WINNOWER allows to infer expressions that are collections of 
words in A* for a given positive integer k that is the length of the expression. 
Like FOOTPRINTER (and unlike SPELLER or similar algorithms which do not use 
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an index), there is no efficient way of handling a range of values for the length 
of the expressions sought. 

The method was elaborated for inferring expressions common to a set of N 
input words but may easily be adapted to find expressions repeated in a single 
input word. It can as easily be modified to handle a quorum lower than N 
although in what follows, the method is presented for a quorum of N only. 

Given N input words, w 1, w2,...,wy of the same length n, an integer d and 
a length k, WINNOWER starts by building the graph G = (Vi U...U Vy, E) as 
indicated in Section 5.1.2. The graph has O(nN) nodes and O(n?N) edges. 

The goal is then to find all cliques of size N in G, which is an NP-complete 
problem. The idea of WINNOWER is to remove edges that cannot belong to 
cliques. This makes the graph sparse enough that clique detection is easier to 
perform. 

This is achieved by incrementally eliminating what are called spurious edges. 
An edge is spurious if it does not belong to any extendable clique of a given 
size where by extendable clique of size c is meant a clique contained in all other 
possible cliques of size c+ 1. By observing that every edge belonging to a clique 
of size N also belongs to at least ies extendable cliques of size c and through a 
suitable choice of c, it is possible to eliminate spurious edges. This is recursively 
done as long as possible. At the end, one expects the graph will contain only 
cliques of size N, or that at least detecting cliques of size N will have become 
very easy to do in the graph that remains. 

The pseudo-code is not given in this case as the core ideas are those described 
above. Care with implementation is required for the efficiency of some essential 
parts of the algorithm but these are not given in enough detail that we feel we 
can reproduce their essence with perfect fidelity. They are therefore omitted. 

The time complexity of WINNOWER is claimed by the authors to be in 
O((nN)°*") which is the cost of eliminating spurious edges (for c = 3, elim- 
inating spurious edges takes on average O(N“n?-®°) time according to them.) If 
d= 0, WINNOWER takes exponential time and is therefore, like FOOTPRINTER, 
not optimal. 

There are interesting instances with critical values of d that cannot be ef- 
ficiently handled by WINNOWER because too few edges can be eliminated and 
the clique detection step must thus be performed in a dense graph. This is the 
case in what the authors called the challenge problem: for instance, for k = 15, 
d = 4, and Card(A) = 4, it is already not feasible to apply WINNOWER to an 
instance as small as 20 words of length 600 each. 


5.2.2.2. MITRA-GRAPH algorithm 


MITRA-GRAPH is an algorithm that mixes the ideas behind MITRA-COUNT (that 
is, behind SPELLER) and WINNOWER. It thus works within both the star and 
clique model. The solutions produced are those that would be obtained with 
MITRA-COUNT for expressions in A* with a distance of d and, originally, a 
quorum of N. Extending it to expressions of a length covering a range of values, 
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or to a quorum less than N is more straightforward and less costly to do than 
for WINNOWER. 

Like WINNOWER, MITRA-GRAPH builds a graph and looks for cliques of size 
N in it. The big difference is that the graph depends now at each step on the 
candidate expression currently considered. The graph is thus denoted by G = 
(7, Vi U...UVy, E), or G = (a, V, E) for short. The set of nodes of G are defined 
as in WINNOWER. It is in the set of edges that the two graphs differ. For each 
node v;, set vj = pis; with |p;| = |x| (and |s;| = k—|a|). In MITRA-GRAPH, there 
is an edge between v; and v; if and only if the three inequalities dist (x, pi) < d, 
dist y(x,p;) < d, and dist q(x, p;) + dist y(x,p;) + dist q(s;, 8;) < 2d hold. The 
condition of existence of an edge is therefore stronger with MITRA-GRAPH than 
with WINNOWER. Finding cliques in this graph is also much easier to do than 
in the graph used by WINNOWER (it basically consists in eliminating all edges 
that enter nodes with degree less than N — 1), while the pruning ideas of both 
MITRA-COUNT (when an expression does not satisfy the quorum any longer) and 
WINNOWER allow this in theory to be a more efficient approach than MITRA- 
COUNT alone. 

The algorithm presents an additional cost due to the fact that the graph has 
to be updated continuously as viable expressions are recursively explored (in 
lexicographic order like in MITRA-COUNT). The key idea in this case comes from 
the observation that once expression xy with y € At,x € A* has been treated, 
either expression xya with a € A will be considered or, if xy did not satisfy the 
quorum and y[1],...,y[|y] — 1] were all equal to the last letter in the alphabet, 
it is expression xb with b € A and different from the first letter in y (it will, in 
fact, be the next letter in the alphabet) that will be considered. From the graph 
G(xy, V, EF), it is easy to obtain G(xb, V, E) if the values of dist y(a, v;[0..(|2|—1)], 
dist x(x, v;[0..(\2| — 1)], distx(v;[|z|..(k — 1), v;[|z|..(k — 1)]) are kept for each 
edge (v;, U;). 

As for WINNOWER and for the same reason (not enough detail is presented in 
the literature indicating how the algorithm is actually implemented), a pseudo- 
code for MITRA-GRAPH is omitted. 


5.3. Inferring network expressions with spacers 


5.3.1. Mathematical models and related inference problem 


In biology, network expressions with spacers are a first approach to model se- 
quences along a molecule, typically DNA, that function in a cooperative way 
in the sense that they need to simultaneously bind a same or different molecu- 
lar complexes so that a given biological process may be initiated. In the case 
of so-called “higher” organisms, the sites may even come in big clusters. The 
relative positions along the molecule of the sites that are inside a cluster are in 
general not random, either because they are recognized by the same complex 
and cannot therefore stand too much apart, or because they are recognized by 
different complexes that interact between them. In this last case, the distances 
between sites along the molecule may be longer but is often quite constrained. 
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Finally, not all positions within a site are equally important for the binding to 
happen. In particular for binding sites in proteins where recognition is strongly 
connected to the 3D structure of the molecule, even a single binding site (single 
in the sense that it binds a unique site in the other molecule) may concern a 
sequence of non contiguous positions at variable distances one from another that 
correspond to amino acids close in 3D space. 

Given an alphabet A, a network expression X with spacers is an ordered 
sequence of simple network expressions X1,...,X, with X1,...,Xp) € P(A)t 
and p> 2. 

The expression X is said to appear exactly in a word w if there exist factors 


U1,...,Up of w such that w = tout, ---tp—1Uptp with to,...,t) € A* and 
u; € X; for 1 <i <p. Given d = (d),...,dp) non negative integers, X is said 
to appear d-approximately in w if w = touit1--+tp—1Uptp with to,...,tp € A®* 


and, for all i € [1, p], there exists vu; € X; such that disty(uj, vi) < di. 

Finally, given a network expression X with constrained spacers, that is, 
given a sequence of simple network expressions X1,...,Xp, positive integers 
dy,...,dp, and intervals [mini, mazy],...,[Minp—1, Maxp_y] with min; < maz; 
non negative integers, X is said to appear approximately in a word w if w = 
touity -++tp—1Uptp with to,...,tp € A*, uj a dj-approximate occurrence of X; 
for alli € [1, p] and |t;| € [min;, maz,] for all 7 € [1, p—1]. The case of intervals 
containing negative values may also be considered but has not been treated in 
the literature. 

From now on, network expressions X with constrained spacers and com- 
posed of p simple network expressions X1,..., Xp separated by distances within 
the intervals [min,maz,],...,[minp—1,mazxp_—1]| will be denoted by X = X1 
[miny, max,| X2...Xp-1 [Minp—-1, Maxp_1|Xp. Network expressions with un- 
constrained spacers and composed of p simple network expressions X1,...,Xp 
will be denoted by X = X1 *...* Xp. 


5.3.2. Algorithms 
5.3.2.1. Inferring network expressions with constrained spacers 


MITRA-DYAD algorithm. MITRA-DYAD infers network expressions with con- 
strained spacers for the case p = 2 only, that is expressions of the type X = 
X\[min, max] X2. The reason is that the inference is performed in a but contain- 
ing O(max — min +1) times more nodes and potentially O((max — min + 1)?) 
more edges. Indeed, supposing |.X;| = |X2| = k, each factor u of length k of the 
input words w),...,wy, which corresponded to a node in the original graph, 
gives now rise to O(max — min-+ 1) nodes, each one corresponding to the factor 
u followed by the factor v starting 7 positions after the end of u, when such po- 
sition exists, for 7 between min and max. Nodes are then linked under the same 
conditions as for MITRA-GRAPH; in particular, the existence of an edge between 
two nodes remains dependent on the expression X = X,[min,maa]X2 that is 
being currently considered. Once this graph is built, MITRA-DYAD runs MITRA- 
GRAPH on it. The way the graph is built ensures that the solutions found in this 
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way correspond to the required network expressions with constrained spacers. 


SMILE algorithm. ‘There are in fact two versions of the SMILE algorithm. 
Both versions call the SPELLER algorithm given in Section 5.2.1 as an inter- 
nal subroutine. The basic algorithm for a single input word is shown be- 
low. It is straightforward to adapt it to the case of N input words. For 
ease of exposition, we assume that, in the expression X = X1[mini, max1|X2 
... Xp—1[Minp—1, Maxp—1|Xp, all expressions X; have the same length k and 
maximum number of differences allowed d. The way the search space is con- 
sidered is what makes the main difference between the two versions of SMILE. 
It is worth observing that besides being able to handle a different distance d 
for each simple expression in X, SMILE can handle a global distance, something 
MITRA-DYAD cannot. 

The SMILE algorithm shown below assumes that p = 2 and that the suffix 
tree J of w has been previously built. The notations X; and X»2 stand for 
candidate motifs for, respectively, the first and second expressions in the network 
expression with spacer that is being searched for. 


SMILE(w, g, k, d, (mini, max1),...,(minp, Max,)) 
1 fori«+1top—I1do 
2 for each solution of SPELLER(X1, w,q,k,d) do 
3 consider only the search space of all factors of w 
4 that start from min; to max; positions after 
5 occurrences of X1 in w 
6 return SPELLER( Xo, w, q, k, d) 


The extension of SMILE to the case where p > 2 is straightforward. The 
difference between the two versions of the SMILE algorithm are in how lines 
3 to 5 are dealt with. We explain it in the simple case where p = 2 and 
|X1| = |Xo| = k. Generalization to different lengths (or range of lengths) 
for each simple expression in X, or to a general p, is straightforward for the 
first version and more elaborated for the second. Details may be found in the 
literature indicated in the notes at the end of the chapter. 

The first version of SMILE proceeds as follows. For each expression X, of 
length & satisfying the quorum that is obtained, together with its set of node- 
occurrences in J, that we denote by occx,, all simple expressions X2 are sought. 
The search starts (using SPELLER) with the expression X2 = € and occx, the set 
of words v which have an ancestor u in occx, with min < level(v) — level(u) < 
maz, where level(v) indicates the length of the label of the path from the root 
to node v in T. From a node-occurrence u in occx,, a jump is therefore made 
in T to all potential start node-occurrences v of X2. These nodes are the min 
to maxz-generation descendants of u in T. 

The second version of SMILE initially proceeds like the first. For each simple 
expression X, inferred, and for each node-occurrence u of X, considered in 
turn, a jump is made in J down to the descendants of u located at lower levels. 
This time however, the algorithm just passes through the nodes at these lower 
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levels, grabs some information the nodes contain and jumps back up to level k 
again. The information grabbed in passing is used to temporarily and partially 
modify J and start, from the root of T, the inference of all possible companions 
X2 for X, that are located at the required distance (min,max). Once this 
operation has ended, the part of J that was modified is restored to its previous 
state. The inference of another simple expression X, then follows. The whole 
process unwinds in a recursive way until all expressions X satisfying the initial 
conditions are inferred. 

More precisely, the operation between the spelling of X, and X2 locally 
alterates JT up to level k into a tree J’ that contains only the prefixes of length 
k of suffixes of w starting at a position between min and maz from the end 
position in w of an occurrence of X,. Tree T’ is, in a sense, the union of all the 
subtrees t of depth at most k rooted at nodes that represent start occurrences 
of a potential companion X2 for X,. SPELLER can then be applied directly to 
T’. The information that is grabbed in passing is the one required to modify 
T into T': it corresponds to the boolean arrays indicating to which factors of 
w belong the leaves of all potential end node-occurrences of companions for X, 
in the tree. 

The complexity of the first version of SMILE for a single input word is O(n + 
N2k+macVq(d,k)) where n2~4+maxr is the number of nodes at level 2k + maz in 
the suffix tree. Its space complexity is O(n(2k + maz)). 

The complexity of the second version of SMILE for a single input word is 
O(n+min{n2, n2k+max}VF(d, k) + N2k+maxVa(d, k))) and its space complexity 
O(n(2k + max) + nx). 


5.3.2.2. Inferring network expressions with unconstrained spacers 


SMILE algorithm revisited. Extensions of the SMILE algorithm enable also 
to deal with flexible spacers. 

The first extension concerns what is called “meta-differences”. Given a 
non negative integer D, a network expression X = X)[mini,max,|Xq...Xp-1 
[Miny—1, MaXp_1|Xp is said to appear exactly in a word w if there exist factors 
Uj,;-+-, Uj, Of w such that p— D <q <p and w = tou;,t1 +++ tq—1uj,tq, with 
to,.-.,tg € AX, 1 < ji < +++ < jg < pand uj, € X;, fori = 1,...,q. An 
equivalent definition may be derived for approximate occurrences of X. 

The second extension allows SMILE to handle restricted intervals of distances 
between the simple network expressions X1,...,Xp, exploring in a same run a 
wide range of possibilities for the middle value of the interval. The expres- 
sions that may be inferred in this case are of the type X = X[m1 + €1|X2 
... Xp—1[Mp—1 = Ep—1]Xp where, for 1 <i < p, e; is a non negative integer and 
m,; € [Min;, Maa;] with Max; — Min, as large as desirable. 


PRATT algorithm. ‘Trying to infer network expressions with completely un- 
constrained spacers would in most situations lead to trivial solutions besides 
being a computationally harder problem. PRATT therefore imposes some con- 
straints on the amount and distribution of don’t care symbols that are allowed. 
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The expressions treated may however be more flexible than the constrained 
spacers of SMILE as presented in Section 5.3.2.1. We shall see in a moment the 
extensions of SMILE which enable to deal with spacers that are as unconstrained 
as in PRATT, although in a different way. 

The constraints that PRATT puts on the spacers are specified as input pa- 
rameters. Some of the main ones are: 


1. a maximum number of spacer regions, that is of regions that are composed 
of a contiguous sequence of don’t care symbols; 


2. a maximum length for spacer regions; 
3. a maximum number of overall don’t care in the expressions sought; 


4. a maximum length of the network expression. 


Other possible constraints are omitted for the sake of simplicity. 

takes in general as input N words, that is, it infers common network expres- 
sions, but it can easily be modified to treat the case of a single input word. It 
works basically like SPELLER for expressions in S C P(.A)*. Unlike SPELLER, 
PRATT does not use a suffix tree representation of the input word(s) but a sim- 
ple queue or file data structure like COMBI or POIVRE. The don’t care symbol 
is treated in a way similar to another element of S, with counters enabling to 
check whatever spacer constraint was given as input. 

The version of SMILE that allows intervals for the distances between single 
network expressions results in performances analogous to those of PRATT as far 
as spacers are concerned, although in a slightly different way. SMILE can be 
more flexible and it further allows for differences in the inference process. 


5.4. Related issues 


5.4.1. The concept of basis 


Given some input word(s), the number of even simpler expressions X € A(AU 
e)* A can be exponential with the length of the input, so that it is infeasible to 
list all of them along with their occurrences in the word(s). Fixing the Hamming 
distance to 0 or using a high quorum does not avoid the explosive growth in the 
number of such expressions. Several researchers are working to alleviate this 
drawback. 

Among the many methods proposed to select expressions, one can single 
out those based on the notion of maximality or specificity. We assume d = 0. 
Since the expressions may contain don’t cares, approximate occurrences are in a 
certain sense still allowed. Informally, an expression X in A(AUe)* A is maximal 
if it cannot be extended to the left or to the right by adding further symbols 
and/or if none of its don’t care symbols can be replaced by an alphabet letter, 
without losing any occurrences. In other words, specifying more a maximal 
expression causes a loss of information, while this is not true for non-maximal 
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expressions. While the notion of maximality reduces significantly the number 
of expressions, their number may still be exponential. 

A significant step in reducing the number of maximal expressions is the 
introduction of the notion of basis. Informally speaking, a basis is a set of 
(maximal) expressions that can generate all the (other) maximal expressions by 
simple mechanical rules. The maximal expressions in the basis are representative 
of the information content of the words in that they can generate all the other 
repeated or common expressions. A notion of basis called the set of tiling 
motifs was introduced. It has size linear in the length n of the input word(s) 
and it is able to generate the repeated expressions (possibly exponential in 
number) that appear at least twice with don’t care symbols in such input over an 
alphabet A. This basis has some interesting features such as being (a) a subset 
of previously defined bases, (b) truly linear as its expressions are less than n in 
number and appear in the word for a total of 2n times at most; (c) symmetric 
as the basis of the reversed word is the reverse of the basis; (d) computable in 
polynomial time, namely, in O(n? logn log Card(A)) time. As an example, the 
basis of tiling motifs for repeats in the word w = ATATACTATCAT contains three 
elements, namely 1; = ATAeeeTAT, ro = ATATeeT, and x3 = TATAeeAT that are 
able to generate (through a suitable operation that takes also into account the 
positions where the motifs in the basis occur) all other repeated motifs such as 
TAT, TA, AT, ATAeeeT etc that appear at least twice in w. For instance, the motif 
ATAeeeT can be obtained by the overlap of the occurrences of x; and of x2 at 
position 0. 

A more general and flexible framework is required for repeated or common 
expressions when d > 0 and the notion of a basis may perhaps not be extended 
in this case. Some fuzzy form of clustering should then be considered. 


5.4.2. Inferring tandem network expressions 
5.4.2.1. Problem definition 


Tandem arrays (called tandem repeats when there are only two units) are ap- 
proximate powers (squares) of a word, that is, a sequence of approximate repeats 
that appear adjacent in a word. The inference of tandem arrays may proceed 
in much the same way as for simple expressions appearing repeated a number 
of times in a word (using for instance SPELLER or MITRA-COUNT). Checking 
that the expression appears tandemly repeated can then be done a posteriori. 
This however can be a very inefficient approach as many expressions will be 
generated whose occurrences have no chance of forming a tandem array. It is 
therefore more interesting to develop a method that allows to check the tandem 
condition of a repeat as it proceeds with the inference, that is, simultaneously 
with it. The use of a suffix tree is not interesting when approximate matches 
are sought because a suffix tree does not allow the positions of the occurrences 
to be kept ordered for easy processing of the tandem condition. An approach 
like the one adopted by MITRA-COUNT, that was also used earlier in COMBI or 
POIVRE, is the most appropriate in this case. 
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Before sketching the main ideas of the algorithm, called SATELLITE, we need 
to introduce the more complex models required by tandem arrays. There are in 
fact two definitions related to a tandem array model, one called prefix model and 
the other consensus model. This latter concerns tandem array models strictly 
speaking while prefix models are in fact models for approximately periodic re- 
peats that are not necessarily (yet) tandem. They correspond to the prefixes of 
a consensus model. 

Formally, a prefix model of a tandem array is a word x € At (x could also 
belong to P(A)*) that approximately matches a train of wagons. A wagon of « 
is a factor u in w such that distp(x,u) < d for da non negative integer (observe 
that in this case, it is the edit distance that has been considered). A train of a 
prefix model z is a collection of wagons wu), u2,..., Up ordered by their starting 
positions in w and satisfying the following properties: 

(P,) p > q where q is again a quorum indicating this time the minimum 
number of units the sought tandem arrays must have; 

(P2) left... — left, € |min_period, max_period] is the position of the left 
end of wagon u in w and min_period, max-_period are the minimum 
and maximum period of the repeat. 

A consensus model must further satisfy the following property: 

(P3) lefty... — right, =0 
where right,, is the position of the right-end of wagon u. The property checks 
that the occurrences of consensus models are indeed tandem. This is verified 
only when |x| € [min_period, max_period], that is when the length of the repeat 
has reached the value specified as input. 

The tandem array inference problem is then the following. 


Inference of tandem array problem 


INPUT: A word w € A*,a quorum g, an edit distance d and a minimum 
and maximum period min_period and max_period. 


OUTPUT: Allexpressions x € At that are consensus models for tandem 
arrays (that is, properties (P,), (P2) and (P3) are satisfied). 


5.4.2.2. SATELLITE algorithm 


Expressions for tandem arrays are inferred by increasing length. The algorithm 
keeps track of individual wagons, and at each step determines, on the fly, if 
they can be combined into at least one train (observe that a wagon can belong 
to more than one train). The latter corresponds to checking, for each wagon, 
whether it belongs to at least one set of wagons satisfying properties (P,) and 
(P2/P3) above. 

For each expression x that is a prefix model for a tandem array, a list of the 
wagons of x that belong to at least one train of x is kept. When the expression 
x is extended into the expression x’ = xa, two tasks must be performed: 


1. determine which wagons of x can be extended to become wagons of 2’; 
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2. among these newly determined wagons of x’, keep only those that belong 
to at least a train of x’. This requires effectively assembling wagons into 
trains. 


The trains do not need to be enumerated in the second step, it must only be 
determined if a wagon is part of one. This allows to perform an extension step 
in time linear with the length of the input word. 

Consider the directed graph G = (V,E) where V is the set of all wag- 
ons of x and there is an edge from wagon u to wagon v if left, — left, © 
[min_period, max_period]. A wagon u is then part of a train if it is in a path of 
length g or more in G. Determining this is quite simple as the graph is clearly 
acyclic. 

For each expression x of length between min_period and maz-_period, it 
remains to check whether «x satisfies the properties of a consensus model for 
a tandem array. Consider now a directed bipartite graph G, = (L, U Ry, E) 
whose vertices are the positions at which, respectively, the left- and right-ends of 
wagons of x occur. Edges 7 > j with i € Lyz,j € Ry are wagon edges and edges 
j—7itwith 7 € Ly,j € Ry are gap edges. There is a wagon edge i — j if and 
only if w[i..j — 1] is a wagon, and there is a gap edge j — i if and only if i = 7. 
Thus, there is an edge sequence 7 — j — k occurs in G; if and only if there are 
wagons wu and v such that u = wywj41--:w;-1, left, = k, and left,, — right, = 0. 
It follows that a position/node which is on a path of length 2q or more is part 
of a train satisfying properties (P;), (P2) and (P3). Such a position is called a 
final position or final node. Let G’, be the graph induced by the set, Fy, of all 
final nodes. If G’, is non-empty, then « is a consensus model for a tandem array 
having the characteristics specified in the input. 

The complexity of SATELLITE is O(n max-period Mp(d,k)) where n is, as 
before, |w| and Mz(d,k) is the size of the set containing all words of length k at 
edit distance d from another word of length k. This is actually the complexity 
of SPELLER multiplied by the term max_period because of the need to check for 
the tandem condition. An extended version of SATELLITE allows to deal with 
tandem arrays that may miss a period, meaning that the repeat may contain 
some units that have accumulated more differences than allowed. Such units 
are called bad wagons. A number of them may be authorized in a train. 


5.5. Open problems 


5.5.1. Inference of network expressions using edit distance 


In theory, all algorithms presented in this chapter may be modified to handle 
edit instead of Hamming distance. Indeed, edit distance is already an integral 
part of POIVRE (and of SATELLITE). Thus MITRA-COUNT which behaves much 
as POIVRE can easily be extended to use an edit distance. The same is true of 
SPELLER and such a modification was suggested and quickly sketched by the 
authors. A more recent approach using a suffix tree like SPELLER introduces 
what appears to be an algorithm producing a different solution from SPELLER 
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given the same instance. 

There has been also a theoretical discussion on how to introduce edit dis- 
tance into FOOTPRINTER which includes the time complexity that the resulting 
algorithm would have. However, FOOTPRINTER is not suitable for dealing with 
expressions or occurrences of variable length which appear when insertions and 
deletions are allowed. The reason comes from the data structure used (the table 
at each node of the tree). 

WINNOWER could also theoretically handle an edit distance, but the num- 
ber of edges in the graph would grow as would the number of spurious edges. 
MITRA-GRAPH would have the same type of problem but the filtering of spu- 
rious edges is easier to perform and therefore the algorithm might be able to 
handle the situation a lot better than WINNOWER. 

Finally, the first version of SMILE is, like SPELLER, easily modifiable to 
handle an edit distance. Although theoretically not impossible, introducing 
such distance into the second version of SMILE might be more tricky. 

In all cases, it is worth exploring more compact ways of representing the 
occurrences of an expression once insertions and deletions are allowed. One 
possible way extends ideas for pattern matching in a long text with edit distance. 


5.5.2. Minimal covering set 


The concept of minimal covering set of expressions may enable to address two 
difficulties encountered by currently existing combinatorial algorithms for net- 
work expression inference in a set of words. These difficulties are, first how to 
fix a priori the quorum, and second (an even harder problem) how to efficiently 
identify weak and rare expressions? To solve the second problem one can in- 
crease the value of d while simultaneously decreasing the value of g. However, 
this may lead to a huge number of solutions, many of which are uninteresting. 
A minimal covering set extends the concept of individual expressions, with or 
without spacers, to that of a family of expressions which “completely explains” 
a set of words. In a more precise way, the problem could be expressed in the 
following (informal) way. Given a set of input words, one must find a minimal 
set of r > 1 expression(s) (the value of r is unknown at start) such that: 


e each expression has an occurrence in at least one input word; 


e distinct expressions among the r may have occurrences in the same input 
word but the number of times this may happen is smaller than a threshold 
value t (possibly t may be 0: there is no “word overlap” of the expressions); 


e all words are covered by (at least) one expression in the family (strictly 
one in case the threshold ¢ is 0). 
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Notes 


The term network expression to denote repeated regular expressions without 
the Kleene closure was introduced for the first time in (Mehldau and Myers 
1993). 

The literature on grammatical inference is large. Three papers have been 
influential on the theory of learning grammars. The first by Gold (Gold 1967) 
introduced the notion of “language identification in the limit”, the second by 
Wharton (Wharton 1974) relaxed the condition of an exact identification by al- 
lowing for various descriptions of the correct solution, while the third by Valiant 
(Valiant 1984) relaxed such condition by allowing for a solution to be only ap- 
proximately correct. The earliest and main expository of inference problems for 
regular grammars is Angluin (Angluin 1982, 1987). 

The definitions and statements concerning the star model have been adopted 
in several exact algorithms, namely COMBI (Sagot and Viari 1996), POIVRE 
(Sagot, Viari, and Soldano 1997), SPELLER (Sagot 1998), SMILE (Marsan and 
Sagot 2000b, 2001), PRATT (Jonassen, Collins, and Higgins 1995), and MITRA- 
count (Eskin and Pevzner 2002). The uniform property variant of the star 
model has been introduced in (Sagot 1996) and a similar idea in a previous 
paper (Sagot, Soldano, and Viari 1995). 

The closest substring problem has been defined and proved to be NP-com- 
plete in (Fellows, Gramm, and Niedermeier 2002). It remains an open problem 
whether it is parameter-tractable for constant size alphabet when either d alone 
or d and N are fixed (Fellows et al. 2002). 

The definitions of the substring parsimony problem and the proof of its NP- 
hardness are given in (Blanchette, Schwikowski, and Tompa 2000). The small 
parsimony problem was introduced in (Fitch 1975) and the substring parsimony 
problem with losses in (Blanchette 2001, Blanchette, Schwikowski, and Tompa 
2002). The FOOTPRINTER algorithm was presented and analyzed in (Blanchette 
et al. 2000, Blanchette 2001, Blanchette and Tompa 2002, Blanchette et al. 
2002). The time complexity of FOOTPRINTER is O( Nk Card(A)* +nNk) where 
n is the length of each input word (assuming they have same length). The 
highest term was in fact Nk Card(A)* in a first paper (Blanchette et al. 2000). 
This came from the fact that the computation of the Hamming distance between 
two words of length & (which takes O(k) time) is done for each of the O(N) edges 
in F, each of the O(Card(A)*) possible values for x and each of the O(Card(A)*) 
possible values for y. In (Blanchette 2001), an improvement of the original 
algorithm described in (Blanchette et al. 2000) was introduced which enabled 
to get the exponent k& instead of 2k. The improvement is achieved by means of 
an auxiliary table for each edge in F. Details may be found in (Blanchette 2001). 
FOOTPRINTER is thus linear with the size of the input words but exponential 
with the length k of the expressions sought. If d = 0, FOOTPRINTER still takes 
exponential time and is therefore not optimal. The algorithm WINNOWER was 
described in (Pevzner and Sze 2000). One must observe that if a quorum lower 
than N is used, the size of the cliques sought is the only thing that changes 
in WINNOWER. In practice however, the smaller the quorum, the less spurious 
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edges there will be that can be safely eliminated. Winnower’s descendant, , was 
presented in (Eskin and Pevzner 2002, Eskin, Gelfand, and Pevzner 2003). The 
two algorithms that are similar to WINNOWER and MITRA-GRAPH, namely 
IKMRC and POoIvReE, appeared in (Sagot, Viari, Pothier, and Soldano 1995, 
Soldano, Viari, and Champesme 1995). 

Informations concerning the Profile data base can be found in (Buhler and 
Tompa 2001). 

Some examples of the use of relative entropy for the evaluation of network 
expressions can be found in (Vanet, Marsan, and Sagot 1999, Pavesi, Mauri, 
and Pesole 2001b). 

The algorithm SPELLER was introduced in (Sagot 1998), while the two vari- 
ants it inspired, MITRA-COUNT and WEEDER, where described in, respectively, 
(Eskin and Pevzner 2002) and (Pavesi, Mauri, and Pesole 2001la). Detailed 
information about the generalized suffix tree data structure can be found in 
(Bieganski, Riedl, Carlis, and Retzel 1994, Hui 1992). A recent approach using 
a suffix tree as in SPELLER but working with the edit distance is given in (Ade- 
biyi, Jiang, and Kaufmann 2001, Adebiyi and Kaufmann 2002). The approach 
seems to produce a different solution from the one that would result from an 
application of an extension of SPELLER enabling to work with the edit distance. 

Concerning algorithms for inferring network expressions with constrained 
spacers, the various versions of the SMILE algorithm are described in (Marsan 
and Sagot 2000b, 2001), and MITRA-DYAD is presented in (Eskin and Pevzner 
2002, Eskin et al. 2003). 

For the case of unconstrained spacers, PRATT is introduced in (Brazma, 
Jonassen, Vilo, and Ukkonen 1998c, 1998b, Brazma, Jonassen, Eidhammer, and 
Gilbert 1998a, Jonassen et al. 1995). A few years after PRATT was conceived, 
an algorithm that is roughly equivalent to PRATT in terms of its output was 
elaborated which uses a lazy implementation of the suffix tree (Giegerich, Kurtz, 
and Stoye 1999) to represent the patterns as these are produced. The lazy suffix 
tree construction as adapted by the authors to their needs takes quadratic time 
but is claimed to be efficient in most practical situations (Brazma et al. 1998c). 

The notion of basis of repeated motifs was introduced in (Parida, Rigoutsos, 
Floratos, Platt, and Gao 2000, Parida, Rigoutsos, and Platt 2001). The more 
recent notion of tiling motifs was described in (Pisanti, Crochemore, Grossi, and 
Sagot 2003). 

Finally, the SATELLITE algorithm for inferring tandem network expressions 
can be found in (Sagot and Myers 1998). 

Readers interested in approaches to the inference, in biological applications, 
of simple network expressions or of network expressions with spacers using 
heuristics or statistical methods may consult (Durbin, Eddy, Krogh, and Mitchi- 
son 1998), (Pevzner 2000) and (Waterman 1995). Machine learning techniques 
have also long been in use for inferring patterns or grammars. References to 
some of these techniques as applies to biology may be found in (Baldi and 
Brunak 1998). 
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6.0. Introduction 


Statistical and probabilistic properties of words in sequences have been of con- 
siderable interest in many fields, such as coding theory and reliability theory, 
and most recently in the analysis of biological sequences. The latter will serve 
as the key example in this chapter. We only consider finite words. 

Two main aspects of word occurrences in biological sequences are: where 
do they occur and how many times do they occur? An important problem, for 
instance, was to determine the statistical significance of a word frequency in a 
DNA sequence. The naive idea is the following: a word may be significantly rare 
in a DNA sequence because it disrupts replication or gene expression, (perhaps 
a negative selection factor), whereas a significantly frequent word may have a 
fundamental activity with regard to genome stability. Well-known examples 
of words with exceptional frequencies in DNA sequences are certain biological 
palindromes corresponding to restriction sites avoided for instance in EF. coli, 
and the Cross-over Hotspot Instigator sites in several bacteria. Identifying over- 
and under-represented words in a particular genome is a very common task in 
genome analysis. 

Statistical methods to study the distribution of the word locations along a 
sequence and word frequencies have also been an active field of research; the 
goal of this chapter is to provide an overview of the state of this research. 

Because DNA sequences are long, asymptotic distributions were proposed 
first. Exact distributions exist now, motivated by the analysis of genes and 
protein sequences. Unfortunately, exact results are not adapted in practice for 
long sequences because of heavy numerical calculation, but they allow the user to 
assess the quality of the stochastic approximations when no approximation error 
can be provided. For example, BLAST is probably the best-known algorithm 
for DNA matching, and it relies on a Poisson approximation. Approximate p- 
values can be given; yet the applicability of the Poisson approximation needs to 
be justified. 

Statistical properties of words only make sense with respect to some under- 
lying probability model. DNA sequences are commonly modeled as stationary 
random sequences. Typical models are homogeneous m-order Markov chains 
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(model Mm) in which the probability of occurrence of a letter at a given po- 
sition depends only on the m previous letters in the sequence (and not on the 
position); the independent case is a particular case with m = 0. Hidden Markov 
models (HMMs) reveal however that the composition of a DNA sequence may 
vary over the sequence. However, no statistical properties of words have been 
yet derived in such heterogeneous models. DNA sequences code for amino acid 
sequences (proteins) by non-overlapping triplets called codons. The three posi- 
tions of the codons have distinct statistical properties, so that for coding DNA 
we naturally think of three sequences where the successive letters come from the 
three codon positions, respectively. The three chains and their transition ma- 
trices are denoted as Mm-3. In this chapter, we will focus on the homogeneous 
models Mm and give existing results for Mm-3. 

Because these probabilistic models have to be fitted to the observed biolog- 
ical sequence, we will pay attention to the influence of the model parameter 
estimation on the statistical results. Some asymptotic results take care of this 
problem but the exact results require that the true model driving the observed 
sequence is known. 

The choice of the Markov model order depends on the sequence length, 
because of the data requirements in estimation. One might be able to test 
hierarchical models using Chi-square tests to assign which order of Markovian 
dependence is appropriate for the underlying sequence. From a practical point 
of view, it also depends on the composition of the biological sequence one wants 
to take into account. Indeed, if the sequence was generated from an m-order 
Markov chain, then the model Mm provide a good prediction for the (m + 1)- 
letter words. 

In this chapter, we are concerned firstly with the occurrences of a single 
pattern in a sequence. To begin, we discuss the underlying probabilistic models 
(Section 6.1). The main complication for word occurrences arises from over- 
laps of words. One might be interested either in overlapping occurrences or 
in particular non-overlapping ones (Section 6.2). After presenting results for 
the statistical distribution of word locations along the sequence (Section 6.3), 
we focus on the distribution of the number of overlapping occurrences (Section 
6.4) and the number of renewals (Section 6.5). In Section 6.6, we will study 
the occurrences of multiple patterns. Section 6.7 gives two applications on how 
probabilistic and statistical considerations come into play for DNA sequence 
analysis. Firstly, we look for words with unexpected counts in some DNA se- 
quences. The focus will be on the importance of the order m of the Markov 
model used and on the interest of using a model of the type Mm-3 (with three 
transition matrices), when analyzing a coding DNA sequence. We will also take 
the opportunity to compare exact and asymptotic results on the word count 
distributions. Secondly, we describe how to analyze so-called SBH chips, a fast 
and effective method for determining a DNA sequence. These chips provide the 
é-tuple contents of a DNA sequence, where typically @ = 8,10 or 12. A non- 
trivial combinatorial problem arises when determining the probability that a 
randomly chosen DNA sequence can be uniquely reconstructed from its ¢-tuple 
contents. Finally, Section 6.8, meant as an appendix, gives a compilation of 
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more general techniques that are applied in this chapter. 

Due to the abundance of literature, the present chapter has no intention of 
being a complete literature survey (indeed even just a list of references would 
take up all the space designated to this chapter), but rather to introduce the 
reader to the major aspects of this field, to provide some techniques and to warn 
of major pitfalls associated with the analysis of words. For the same reason we 
completely omit the algorithmic aspect. 


6.1. Probabilistic models for biological sequences 


In this chapter, a biological sequence is either a DNA sequence or a protein 
sequence, that is, a finite sequence of letters either in the 4-letter DNA alphabet 
{a,c,g,t} or the 20-letter amino-acid alphabet. To model a biological sequence, 
we will consider models for random sequences of letters. Even if we observed 
a finite biological sequence S = s;52---S,, we consider for convenience in the 
whole chapter an infinite random sequence X = (X;)jez on a finite alphabet 
A, where Z is the set of integers. We present below two classes of Markov 
models widely used to analyze biological sequences and how to estimate their 
parameters according to the observed sequence. Then we give a classical Chi- 
square test to choose the appropriate order of the Markov model for a given 
sequence. 

However, we will see in Section 6.7.1 that the choice of the model also has 
to take biological considerations on the sequence composition into account. 


6.1.1. Markovian models for random sequences of letters 


The simplest model assumes that the letters X; are independent and take on the 
value a € A with probability (a) = 1/Card(A), where Card(A) denotes the 
size of the alphabet. To refine this model, we can simply assume independent 
letters taking values in A with probabilities (u(a@))aca such that }),- 4 u(a) = 1. 
This is called model MO. Typically for DNA sequences, this model is not very 
accurate. Therefore, we consider a much more general homogeneous model, the 
model Mm: an ergodic stationary m-order Markov chain on a finite alphabet A 
with transition matrix TI = (1(@1 ---@m,@m41))ar,...,am41¢A Such that 


1 (a4 ae Gm; Am-+1) = P(X; = Am+1 | Xi-1 =Am;,--- Pe, Game => ay). 


In general, a stationary distribution of an ergodic stationary Markov chain 
with transition matrix II is defined as a solution of 44 = uw. This implies that 
the above Markov chain has a unique stationary distribution ~ on A” defined 
by 


Hitdpe te, = Psst Nin Soyeee) VEZ 


such that the equation 


[U(aa +++ Am) = S > u(bay «++ dm—1)(bay + ++ Gm—1, Om) 
beA 
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is satisfied for all (a1---am) € A™. The model where the letters {X;}iez are 
chosen independently with probabilities p1, p2,...,pj4, corresponds to the tran- 
sition matrix II with identical rows (p; pz --- pj4)) and stationary distribution 
= (Piz Pay +--4 Pal): 

A coding DNA sequence is naturally read as successive non-overlapping 3- 
letter words called codons. These codons are then translated into amino acids 
via the genetic code to produce a protein sequence. Several different codons can 
code for the same amino acid, and often the first two letters of a codon suffice to 
determine the corresponding amino acid. Therefore, letters may have different 
importance depending on their position with respect to the codon partition. 
To distinguish the letter probabilities according to their position modulo 3 in 
the coding DNA sequence, we consider a stationary Markov chain with three 


distinct transition matrices II,, Hz and Is such that, for a1,...,@m41 € A and 
ke {1,2,3} 
Tr(@1°**@m;@m41) = P(X3544 = Am41 | X374h-1 = Am,---,X3j+k—-m = 41). 


This is model Mm-3. The index k € {1,2,3} is called phase and represents the 
position of a letter inside a codon. By convention, the phase of a word is the 
phase of its last letter in the sequence; codons are then 3-letter words in phase 
3. 

The stationary distribution w on A™ x {1, 2,3} is given by 


(ay "*'Am,; k) = P(X3j4k—-m+41 ae X3j+k =ai::: Gm), Vj E Z 


such that the equation 


E(a1-++Am,k) = 2 p(bay +++ Am—1, & — 1)1p (bar +++ dm—1, dm) 
bEA 


is satisfied for all (a, --+@m,k) € A™ x {1,2, 3}. 


Some general results for Markov chains will be used in the exposition. For 
simplicity we concentrate here on the case of a l-order Markov chain. 


The stationary distribution of a Markov chain can be obtained from its 
transition matrix. For a l-order Markov chain we diagonalize the transition 
matrix as follows. Let (az):=1,...,|.4) be the eigenvalues of II such that |ay| > 
|a2| > +--+ > |a,4)|. The Perron-Frobenius Theorem ensures that a; = 1 and 
|a2| < 1; we abbreviate 


Q:= Q2. (6.1.1) 


Then (1,1,...,1)7 is a right eigenvector of II for the eigenvalue 1 whereas the 
vector of stationary distribution (u(a),a € A) is a left-eigenvector of II for the 
eigenvalue 1. Let D = Diag(1,a,a3,---,a),4)). We decompose I = PDP"! 
such that the first column of P is (1,1,...,1)7; then the first row of P~! is 
the vector of stationary distribution (u(a),a € A). For all ¢ € {1,...,|AJ}, 
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I, denotes the |A| x |A] matrix such that all its entries are equal to 0 except 
I, (t,t) = 1, and we define 


Oi Pie. (6.1.2) 


We shall use the following decomposition of the h-step transition matrix II” 


|Al 
HP SPD P= oO) (6.1.3) 
t=1 
and that 
Qi(a,b) = (db), Va,be A. (6.1.4) 


In the exposition, we shall also refer to the reversed Markov chain, for a 
l-order chain. Its h-step transition probabilities are given by 


0,9) = wade (a0) 
hea) 


where the (1) (a, b))’s are the h-step transition probabilities for the chain itself. 
Another useful quantity is 


= 1-mi i i b)?. wl. 
p in | p(t), Yo gm } (6.1.5) 


be A 


These quantities can easily be generalized to m-order Markov chains, using 
the following embedding. Let us now assume that the sequence (X;)iez is a m- 
order Markov chain on the alphabet A, with transition probabilities 7(a1 +--+ am, 
Qm+1); Q1,°*';4m41 € A. Rewrite the sequence over the alphabet A” by 
defining 


RS ca, (6.1.6) 


so that the sequence (X;);ez is a first-order Markov chain on A™ with transition 
probabilities, for A = a, ---am € A™ and B= b---b, € A™, 


_ f t(a1-++ Am, bm) if a2-++ Gm = b1 +++ bm—1 
TI(A, B) . otherwise. 


6.1.2. Estimation of the model parameters 


Modeling a biological sequence consists of choosing a probabilistic model (see 
previous paragraph) and then estimating the model parameters according to 
the unique realization that is the biological sequence. In the case of model 
Mm, it means to estimate the transition probabilities 1(a1-+-@m,@m-+41); their 
estimators are classically denoted by 7(a@1 +--+ @m,@m-+1)- 
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We now derive the estimators that maximize the likelihood of the M1 model 
given the observed sequence; we will then give the maximum-likelihood estima- 
tors in models Mm and Mm-3. 

Assume X1---X, is a stationary Markov chain on A with transition matrix 
II = (r(a,b))avea and stationary distribution (14(a))ae4. The likelihood L of 
the model is 


L(n(a,b),a,b € A) = u(X1) [J (r(a,0)) NO? 
a,beEA 


where N(ab) denotes the number of occurrences of the 2-letter word ab in the 
random sequence X,---X,,. To find the transition probabilities that maximize 
the likelihood, one maximizes the log likelihood 


log L(m(a, b), a,b € A) = log u(X1) + s, N (ab) log x(a, b). 
a,beEA 


One can separately maximize )°,< 4 N(ab) log x(a, b) for a € A, keeping in mind 
that \,¢4 7(a,b) =1. Let a € A and choose c € A; we have 


S- N(ab) log x(a, b) = » N(ab) log x(a, b) + N(ac) log | 1 — x (a, b) 


beA bAc bc 


and forb#c 


a) , ae _ N(ab) _ N(ac) 
Ox(a, b) (= aap eam ») t(a,b) (a,c) 


beA 
All the partial derivatives equal to zero means that 


N(ab) — N(ac) 


t(a,b) (a,c) 


VbeE A; 
this implies in particular that 


N(ab) _ Laca Nad) _ sais Niue 
m(a,b) Vaca m(a,d) — dN d) := N(ae) VbE A. 


It follows that 


VbEA. 


Note that the second partial derivatives of the likelihood function are negative, 
assuring that we have indeed determined a maximum. 


REMARK 6.1.1. For notational convenience, the estimators mainly used in the 


remainder of the chapter will be 7(a,b) = N(ab)/N(a) since N(ae) = N(a) 
except for the last letter of the sequence for which the counts differ by 1. 
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It is important to note that the estimators 7(a,b) are random variables. 
Assuming that the biological sequence is a realization of the random sequence, 
one can calculate a numerical value for the estimator of z(a,b); that is 


- Nobs (ab) 


sobs 
W (a, b) a Nobs (ae) ’ 


where N°S(-) denotes the observed count in the biological sequence. As we will 
see, some results are obtained assuming that the true parameters 7(a,b) are 
known and equal, in practice, to N°S(ab)/N°>S(ae), and do not take care of 
the estimation. It is indeed a common practice to substitute the estimator for 
the corresponding parameter in distributional results, but sometimes it changes 
the distribution being studied, as we will see later. 

In the model Mm, the maximum-likelihood estimator of (a1 +++ @m,Qm-+1); 
Oy 2065 Omer Ee A, is 


N(a1 +++ @mQm-+1) 


a as CET OS 
m 


and in model Mm-3, we have Vay,...,@m41 € A, Vk € {1, 2,3}, 


A N(ay ***Amam+1; k) 
a aS ee 
a1 +++ Am), 
beEA 


6.1.3. Test for the appropriate order of the Markov model 


To test which Markov model would be appropriate for a given sequence of length 
n, the most straightforward test is a Chi-square test, which can be viewed as 
a generalized likelihood ratio test. Most well-known is the Chi-square test for 
independence. 

Suppose we have a sample of size n cross-classified in a table with U rows 
and V columns. For instance, we could have four rows labeled a,c,g,t, and 
four columns labeled a,c,g,t, and we count how often a letter from the row 
is followed by a letter from the column in the sequence. First we test whether 
we may assume the sequence to consist of independent letters. To this purpose, 
recall that N(ab) denotes the count in cell (a,b), whereas N(ae) is the ath row 
count, and let N(eb) is the bth column count. Thus N(ab) counts how often 
letter a is followed by letter b in the sequence. Let 7(a,b) be the probability of 
cell (a,b), let 7(a,e) be the ath row marginal probability, and let z(e,b) be the 
bth column marginal probability. We test the null hypothesis of independence 


Ho : x(a, b) = r(e, b) 


against the alternative that the (a, b)’s are free. Under Ho, the maximum- 
likelihood estimate of z(a, b) is 


it(a,b) = #(0,b) = 
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The Pearson chi-square statistic is the sum of the square difference between 
observed and estimated expected counts, divided by the estimated expected 
count, where expectations are taken assuming that the null hypothesis is true. 
Thus, under Ho, for the count N(ab) we expect (n — 1)u(a)r(e, b), estimated 
by N(ae)z(e,b), and the chi-square statistic is 


2=Ey (N (ab) — N(ae)N(0b)/(n = 1))? 
eee N(ae)N(eb)/(n — 1) 

Under the null hypothesis, x? follows asymptotically a chi-square distribution 
with (Card(.A) —1)? degrees of freedom. Thus we would reject the null hypothe- 
sis when x? is too large, compared to the corresponding chi-square distribution. 
As arule of thumb, this test is applicable when the expected count in each row 
and column is at least 5. Applying this test to DNA counts, we thus would have 
to compare y? to a chi-square distribution with (4 —1)? = 9 degrees of freedom. 
A typical cutoff level would be 5%, or, if one would like to be conservative, 1%. 
The corresponding critical values are 16.92 for 5 %, and 21.67 for 1 %. Thus, 
if x? > 16.92, we would reject the null hypothesis of independence at the 5 % 
level (meaning that, if we repeated this experiment many times, in about 5% 
of the cases we would reject the null hypothesis when it is true). If vy? > 21.67, 
we could reject the null-hypothesis at the 1 % level (so in only about 1 % of all 
trials would we reject the null hypothesis when it is true). Otherwise we would 
not reject the null hypothesis. 

If the null hypothesis of independence cannot be rejected at an appropriate 
level (say, 5 %), then one would fit an independent model. However, if the null 
hypothesis is rejected, one would test for a higher-order dependence. The next 
step would thus be to test for a first-order Markov chain. We describe here the 
general case. 

Suppose we know that our data come from a Markov chain of order at 
most m. Let N(aja2...@m+4i) be the count of the vector (a1, d@2,..-,;@m+1) 
in the sequence (Xj,...,Xn), let N(aia2...@me) be the count of the vector 
(a1, @2,.--,@m) in the sequence (X1,...,Xn—1), let N(eam—r41---@me) be the 
count of the vector (@m—7+1,---;@m) in the sequence (X;41,.-.,Xn—1),7<™, 
and let N(edm—r41---@m+1) be the count of the vector (@m—7+41,---;@m-+1) in 
the sequence (X,41,..-,Xn). Put 


ma Am, @ )= Siete retin 
1---G4m;4m-+1 N(¢@m—r+1 eae Amn®) ; 


Then under the null hypothesis of having a Markov chain of order r against the 
alternative that it is a Markov chain of order higher than r, the test statistic 


3 2 
‘ 3 (N(a1a2...@m41) — N(a1aq...@me)it(a1...@m,Gm41)) 


Bees ae N(a1a2...Qme)i(Q1..-Qm;A4m-+1) 


is asymptotically chi-square distributed; the degrees of freedom are given by 
(Card(A)™*! — Card(A)™) — (Card(.A)"*+ — Card(A)"). 
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Although this test can be carried out for arbitrary orders, caution is advised: 
for higher order, a longer sequence of observations is required. 


6.2. Overlapping and non-overlapping occurrences 


Statistical inference is often based on independence assumptions. Even if the se- 
quence letters are independent and identically distributed, the different random 
indicators of word occurrences are not independent due to overlaps. For exam- 
ple, if w = atat occurs at position 7 in the sequence, then another occurrence 
of w is much more likely to occur at position i+ 2 than if w did not occur at 
position 7, and an occurrence of w at position 7+ 1 is not possible. Many of the 
arguments needed for a probabilistic and statistical analysis of word occurrences 
deal with disentangling this overlapping structure. 

Let w = w,---we be a word of length @ on a finite alphabet A. Two 
occurrences of w may overlap in a sequence if and only if w is periodic, meaning 
that there exists a period p € {1,...,@—1} such that w; = wip, i= 1,...,0—p. 
A word may have several periods: for instance gtgtgtg admits three periods, 
2, 4, 6, and aacaa has the periods 3 and 4. The set P(w) of the periods of w is 
defined by 


Pw) c= {pe {1 on.gh — 1) 2 ey = Wie, Vi = 1,240.50 — ph. 


A word w is not periodic if and only if P(w) is empty. As we will see later, not all 
periods of a word will have the same importance; we distinguish the multiples of 
the minimal period po(w) of w from the so-called principal periods of w, namely 
the periods that are not strictly multiples of the minimal period. We denote by 
P'(w) the set of the principal periods of w. For instance, P’(gtgtgtg) = {2} 
and P’(aacaa) = {3,4}. 

Occurrences of periodic words tend to overlap in a sequence. There are 
4 occurrences of aacaa in the sequence tgaacaaacaacaatagaacaaaa, starting 
respectively at positions 3, 7, 10 and 18. The first 3 occurrences overlap and 
form a clump. A clump of w in a sequence is a maximal set of overlapping 
occurrences of w in the sequence. By definition two clumps of w in a sequence 
cannot overlap. A clump composed of exactly & overlapping occurrences of w is 
called a k-clump of w. There are 2 clumps of aacaa in the previous sequence, 
the first one is a 3-clump starting at position 3 and the second one is a 1-clump 
starting at position 18. Let C;,(w) be the set of the concatenated words composed 
of exactly k overlapping occurrences of w. For example, C;(aacaa) = {aacaa} 
and C2(aacaa) = {aacaacaa, aacaaacaa}. 

For a word w = w,--+-we we use the following prefix and suffix notation: 


w?) = wy,-: ‘Wy denotes the prefix of w of length p 
W(q) = We-qt1***we denotes the suffix of w of length q, (6.2.1) 


and ww = wy: -WpW +++ we is the concatenated word obtained by two over- 
lapping occurrences starting p positions apart. If p € P(w) then w“) is called 
a root of w; if p€ P’(w), w is called a principal root of w. 


Version June 23, 2004 


6.2. Overlapping and non-overlapping occurrences 261 


Related to the set of periods is the autocorrelation polynomial Q(z) associ- 
ated with w defined by 


Qz)=1+ his (6.2.2) 


Renewals are another type of non-overlapping occurrences of interest that 
require scanning the sequence from one end to the other: the first occurrence of 
w in the sequence is a renewal and a given occurrence of w is a renewal if and 
only if it does not overlap a previous renewal. Renewals of w do not overlap 
in a sequence. In the above example, there are 3 renewals of aacaa starting at 
position 3, 10 and 18. 

Depending on the problem, one could be interested in studying the over- 
lapping occurrences of w in a sequence, or in restricting attention to non- 
overlapping occurrences: the beginnings of clumps, the beginnings of k-clumps 
or the renewals. We now introduce notation related to occurrences of a word 
w= w,:-:we, of a clump of w, of a k-clump of w, of a renewal of w in a 
sequence, and to the corresponding counts. 


Occurrence and number of overlapping occurrences An occurrence of w 
starts at position i in the sequence X = (X;)iez if and only if X;---Xi4e-1 = 
w1-+- we. Let Y;(w) be the associated random indicator 


Y;(w) := I{w starts at position 7 in X}. (6.2.3) 


For convenience in some sections, Y;(w) will be the random indicator that an 
occurrence of w ends at position 7 in X; it will be made precise in that case. 

In the stationary m-order Markovian model, the expectation of Y;(w), that 
is, the probability that an occurrence of w occurs at a given position in the 
sequence, is denoted by p(w) and is given by 


Lm (w) = p(w ++ Wm) T(W1 Wm, Wm41) ++ T(Weem ++ We-1, We). (6.2.4) 


When there is no ambiguity, the index m referring to the order of the model 
will be omitted. 

The number of overlapping occurrences of w in the sequence (X;)j=1,....n; 
simply called cownt of w in this chapter, is defined by N(w) = N,(w) = 
yet yYi(w) (or N(w) = 2, Vi(w) if ¥;(w) is associated with an occur- 
rence of w ending at position 7). 

Clump and declumped counts A clump of w starts at position 7 in the 
infinite sequence X if and only if there is an occurrence of w starting at position 
i that does not overlap a previous occurrence of w. It follows that 


Y;(w) := L{a clump of w starts at position i in X} 
= Y¥i(w)(1 — Yi-a(w)) ++ (1 — Yi-ep.a (w)). (6.2.5) 
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Often Y;(w) is zero, depending on the overlapping structure of w. Using the 
principal periods, it turns that 


¥i(w) =¥i(w)— S> ¥i-p(w®)w) (6.2.6) 
pEP'(w) 


with the notation from (6.2.1). Equation (6.2.6) is obtained from the two fol- 
lowing steps: (i) note that an occurrence of w starting at position i overlaps a 
previous occurrence of w if and only if it is directly preceded by an occurrence 
of a principal root of w, meaning that a principal root w), p € P’(w), occurs 
at position i — p, (ii) note that the events E, = {Y;_,(w™) = 1}, p € P’(w), 
are disjoint. To prove (ii), we assume that two different principal roots w"?) and 
w™ occur simultaneously at position i— p and i—q. If so, the minimal root 
w'°) of w could be decomposed into we) = xy = yx where x and y are two 
nonempty words. Now, two words commute if and only if they are powers of 
the same word. Thus, we would obtain the contradiction that the minimal root 
is not minimal. 

It follows from Equation (6.2.6) that the probability j(w) that a clump of 
w starts at a given position in X is given by 


ji(w) = u(w)- $2 p(w) w) 
pEP'(w) 
= (1— A(w)) u(w) (6.2.7) 


where A(w) is the probability for an occurrence of w to be overlapped from the 
left by a previous occurrence of w: 


wP)w 
Aw)= S° a (6.2.8) 


pEP'(w) 


The number N(w) of clumps of w in the finite sequence X,---X,, (or the 
declumped count) may be different from the sum Nine(w) = Ye rw) 
because of a possible clump of w that would start in X before position 1 and 
would stop after position /— 1. The difference N(w) — Ning(w) is either equal 
to 0 or equal to 1. In fact, it can be shown that P(N(w) # Nine(w)) < (€- 
1)(u(w) — p(w)). 


k-clump and number of k-clumps A k-clump of w starts at position 7 in X 
if and only if there is an occurrence of a concatenated word c € C,(w) starting 
at position 7 that does not overlap any other occurrence of w in the sequence 
X. As we proceeded for a clump occurrence, an occurrence of c € Cx(w) is a 
k-clump of w in X if and only if it is not directly preceded by any principal root 
w?) of w and it is not directly followed by any suffix W(q) = We-g41 +++ We with 
q € P’'(w). Some straightforward calculation yields the expression 


Y;,4(w) := {a k-clump of w starts at position i in X} (6.2.9) 
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= SS) [vO- SS Vip(w)-— So Vi(ews)) 


cEC;,(w) pEP'(w) qeEP’(w) 


+ DE ¥i-p(w® ewe) | , 
p.qeP'(w) 


with the notation (6.2.1). It follows that the probability for a k-clump to start 
at a given position is given by 


iew)= YD w-2 DY wle)+ YO ule’ 
c€Cr(w) c'€Ce+1(w) ce ECn+2(w) 
This formula can be simplified. Note that Cy41(w) = {w®c,e € Cr(w),p € 
P'(w)} and p(w") c) = p(y eee = p(o) oo), By using the overlap prob- 


u(c) (Cw 
ability A(w) given in (6.2.8), we have that 


S> wld) =Alw) SZ ule) 


c/ECh41(w) cEC,(w) 


and it follows that 


jix(w) = (1— A(w))? YP ple) 


cEC;,,(w) 
=(1-A(w))’A(w) YS? uc) 
cE€Ce_-1(w) 
= (1— A(w))?,A(w)*! p(w). (6.2.10) 


As for the declumped count, the number of k-clumps of w in the finite 
sequence may be different from the sum NS) (w) = =. Y;4(w) because of 
possible end effects. The probability that these counts are not equal can be 
explicitly bounded, see (6.4.10), (6.4.11) below. Moreover, possible end effects 
may lead to a difference between the count N(w) and )°,.9 kN“) (w), but this 
can also be controlled. 


Renewal and renewal count A renewal of w starts at position 7 in X1---Xy 
if and only if there is an occurrence of w starting at position 7 that either is the 
first one or does not overlap a previous renewal of w. Let I;(w) be the associated 
random indicator: 


I;(w) = I{a renewal of w starts at position 7 in X,---X,} 
i-1 


=Yi(w) [J @-L(w)) (6.2.11) 


j=i-e+1 
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with the convention that Ij(w) = 0 if 7 < 1. Thus, for 2 < @, a renewal 
occurrence of w at position 7 is exactly a clump occurrence of w at 7 in the finite 
sequence. The renewal count makes extensive use of the linear ordering in the 
sequence: it is defined by R(w) = Rn(w) = 2"7*"1,(w). 


6.3. Word locations along a sequence 


Here we are concerned with the length of the gaps between word occurrences. 
First we describe how to obtain the exact distribution of the distance between 
successive occurrences of a word, and then we give asymptotic results. 


6.3.1. Exact distribution of the distance between word occurrences 


Let w = w,---we be a word of length @ on a finite alphabet A. We assume 
that X,---X, is a stationary first-order Markov chain on A with transition 
matrix IT = (7(a,b))a,oea and stationary distribution (/1(a))ae4. Here we are 
interested in the statistical distribution of the distance D between two successive 
occurrences of w and more precisely in the probabilities 


fd) =P(D=4) 
= P(w occurs at i+d and there is no occurrence of w 
between i+ 1 andi+d—1|w occurs ati), d>1. 
In this section, we say that a word w occurs at position 7 if an occurrence of w 
ends at position i; it happens with probability u(w) given in (6.2.4). 

The probability f(d) can be obtained via a recursive formula as follows. It 
is clear that, if 1 <d< ¢—1 andd €P(w), then f(d) = 0. If d € P(w) or if 
d > £ then we decompose the event 

E = {w occurs at i+ d} 


into the disjoint events 


E, = {w occurs at i +d and there is no occurrence of w between i + 1 
and i+d—1} 


and 


E> = {w occurs at i+d and there are some occurrences of w between i + 1 
and i+d-— 1}. 


Thus {F; | w at i} has probability f(d). Moreover F is itself decomposed as 
Ey = Ud7} B2(h), where 


E(h) = {there is no occurrence of w between i+ 1 andi+h—1, 
w occurs at i+ h and i+ d} 
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are again disjoint events. 

Ifl <d< é€—1landd€ P(w), then P(E | wat i) = p(w)/p(w®). 
Moreover, if there are occurrences at positions 7+h and i+d, for some h < d, then 
the occurrences necessarily overlap, and this is only possible for d—h € P(w); 
in this case, P(E2(h) | w at i) = f(h)u(w)/u(w!-4"). Thus, we have 


ww 9) rcheg1 | HUwi at) 
d-heP(w) 


If d > @, then P(E | w at i) = I4-**! (we, wi) u(w)/u(w1). If there is an oc- 
currence at positions i+ h and 1+ d, for some h < d, then we distinguish 
two cases depending on the possible overlap between the occurrences at i+ h 
andi+d: ifd—#+1< h < d—1, they overlap and we use previous cal- 
culation; if 1 < h < d— %, they do not overlap and P(E£2(h) | w at i) = 
f(h)T4-*"*1 (we, wi) u(w)/u(w1). Thus, from 


d—1 
P(E|w at i) = P(E, |w at i) + 5 — P(Ex(h) | w at 4) 


p(w) 
+ » f(A) f—d+h))° 
dtr izacd-1 p(w ) 
d—hEP(w) 


This is the proof of the next theorem. 


THEOREM 6.3.1. The distribution f(d) = P(D = d) of the distance D be- 
tween two successive occurrences of a word w in a Markov chain is given by the 
following recursive formula: 


If1<d<é—1anddd P(w), then f(d =0. 
Ifli<d<-—landde P(w), 


fd) = ee > J) a 


a ay u(w) _ PLR fay wis L(w) 
f(d =U ( 5 tao) oe fn ( Ls tae) 


W 
— >: f(h) ary 
d—l4+1<h<d-1 . 
d—-hEP(w) 
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Since D is the distance between two successive occurrences of w, note that, 
even if d € P(w), f(d) can be null. For instance, by taking w = aaa, we 
have P(aaa) = {1,2}, and f(1) = u(aaa)/u(aa) = m(a,a), f(2) = m7(a,a) — 
f(1)m(a,a) = 0. 

Note that the recurrence formula on f(d) is not a “finite” recurrence since 
calculating f(d) requires the calculation of f(d—1), ..., f(1), involving sub- 
stantial numerical calculations for large d. One can approach this computa- 
tion problem by using the generating function defined by ®p(t) := E(t?) = 
Yaoi f(d)t*. The key argument is that the ®p(t) expression is a rational func- 
tion of the form P(t) /Q(t), and hence the coefficient f(d) of t? can be expressed 
by a recurrence formula whose order is the degree of the polynomial Q(t) (see 
Section 6.8.4). 


THEOREM 6.3.2. The generating function of D is 


=i. 


p()=1-p'w){ aay t ey neat 


u=0 
ue P(W)UL{O} 


REMARK 6.3.3. If the transition matrix II is diagonalizable, there exists 6;, 
Bi €C, i=2---|A], such that 


|A| 


1 t! 1-t 5; 
— Tr we, W peru-l SS 1 Siac _ 
ag 2) | 


implying that the above expression is a rational function with a pole at t = 1. 


REMARK 6.3.4. Since ®p(t) = >>,s, f(d)t?, we have the general following 
properties: ~ 


E(D) = ®p(1) = u-"(w) 
Var(D) = ©4(1) + ®>(1)(1 — ®p(1)). 


Successive derivatives of ®p(t) are obtained using the decomposition stated in 
the previous remark. 


Proof 

The proof of Theorem 6.3.2 is not complicated since one just has to develop 
the sum )>,., f(d)t? with f(d) given by Theorem 6.3.1, but it is very technical. 
We thus only give the main lines of the calculation. By replacing f(d) given by 
Theorem 6.3.1 in > 59 f(d)t%, we obtain a sum of five term 


$,() = Ay = Kot hy a — Ke 


Version June 23, 2004 


6.3. Word locations along a sequence 267 
with 


Ky= ae 


Ko= if eat 


Ki= 5 a FORT, wy) 4d 


d>th=1 1) 


= TCH 1 h)t" qh perl 
Hl 41 seat SoM (uy. m) 


h>1 z>h 


= ui) Cr '@®p( t) So r"( (we, w1)t 
uw u>1 


and 


d—1 


=) (i) tt 


d>€ h=d—l+1 
d—heP(W) 


Ks = 2 f(h) » et 


2=1 
z+l—h—-1EP(W) 
h 


+e Ce 


h>e z=h—l+2 
z+l—-h—-1EP(W) 
> Stew) Sw) 
=~ F(h) » u) ma + paei f(h ES —u " 
h=1 u=l—h u(wh—) h>e u=1 p(w’ ) 
uEeP(W) uEeP(W) 
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Grouping kK, — K2— Ks and K3 — K,4 leads to 


l-1 
(w) U p(w) £—-1 U 
®p(t) = (1- p(t) t TI (we, wi)t” | , 
(wh) (wy) », : 
uEP(W) 
hence 
=i 
£-1 


Using p(w)/p(w) = 1 establishes the theorem. 


The distance D between two successive occurrences of w can be seen as 
the distance between the j-th and (j + 1)-th occurrence of w in the sequence, 
since we use a homogeneous model. It may be useful to study the distance 
D) between the j-th and (j + r)-th occurrence of w, the so-called r-scan. The 
distance D‘) is the sum of r independent and identically distributed random 
variables with same distribution as D. Hence we have 


® pw (t) = (Sn(t))". 


We obtain the exact distribution of D(”) from the Taylor expansion of © pyr) (t): 
the probability P(D = d) is the coefficient of t¢ in the series. 


6.3.2. Asymptotic distribution of r-scans 


In the preceding paragraph, we presented how to obtain the exact distribution of 
an r-scan D‘"), the distance between a word occurrence and the (r — 1)-th next 
one, in a stationary Markov chain of first order. Often one is interested in the 
occurrence of any element of a subset of words; such a subset is called a motzf. 
When analyzing a biological sequence, assume we observe (h+ 1) occurrences of 
a given motif, so that we observe h distances D,,..., D,, between occurrences of 
the motif. Thus we observe (h — r+ 1) so-called r-scans DS") = pal ae D;. To 
detect poor and rich regions with this motif, one is interested in studying the 
significance of the smallest and the largest r-scans, or more generally the kth 
smallest r-scan, denoted by mx, and the kth largest r-scan, denoted by M;. In 
this section, we present a Poisson approximation for the statistical distribution 
of the extreme value m, using the Chen-Stein method. A similar result is 
available for M;, by following an identical setup, so it will not be explained in 
detail here. 

We begin by defining the Bernoulli variables that will be used in the Chen- 
Stein method (see Section 6.8.2): 
Wy (d) := 1{D" < d}, d>0. 


a 
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Denote by 
h—r+1 


W-(@d)= D) Wr) 
i=1 
the number of r-scans less or equal to d. Note the duality principle 
{W(d) < k} = {mz > d}, d> 0. 


We now use Theorem 6.8.2 to get a Poisson approximation for the distribu- 
tion of W~(d). To apply this theorem, we first need to choose a neighborhood 
of dependence for each indicator variable; ideally the indicator variables with 
indices not from the neighborhood of dependence are independent of that indi- 
cator variable. Secondly there are three quantities to bound, called b,, b2, and 
bs, given in (6.8.1), (6.8.2), and (6.8.3). Piecing this together gives a bound 
on the total variation distance between the distributions. Here we proceed as 
follows. 

For i € {1,...,h—r+1}, we choose the neighborhood B; = {j | |i —j| < r}, 
so that p® is independent of oe if 7 ¢ B; (recall the distances Di,..., Dp 
are independent). Let Z,— be the Poisson variable with expectation A~, where 


A” = E(W" (d)) 
=(h—-r+ 1lE(W; (d)) 
=(h—r+1)P(D™ <d). 


Theorem 6.8.2 gives that 


ae h—r+1 
dry (LW~(0),£Zs-)) < —— [PS BO (EW; (@) 
i=1 jeB; 
h—r+1 


Indeed the neighborhood B; is chosen so that W; (d) is independent of W; (d), 
Vj € B;, leading to bg = 0. For j > i, we have 


E(W; (d)W; (d)) = P(D!” <,D"™ <4) 


u P 


=P(D) 2d| D7 aap? xd) 
=P(D.4 <a|DP agro <a 
Therefore, 
h—-r+1 


i=1 jeB\Ci} 
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< %h—-r+1)P(D” < d)S~P(D™ <d| D™® <a) 


s=2 


<2\- S> P(D® < d| DI? <d). 
s=2 
It can be shown that 
st+r—-1 
P(D® <a| DM <a) <P ( S> Dix< ‘ =P(D&-) < d). 
i=r+l1 


We finally get 


dry (L(W~(d)), £(Zy-)) < (« —1)P(D™ < d)+ 2s> P(D“ < ») 


s=1 


From the duality principle, 


|P(m, > d)— P(Z)- <k)| < (« Pip” <a) 2s P(D®) < ») 


s=l1 
“(ise }, 


This approximation is very useful for the comparison between the expected 
distribution of the r-scans and the one observed in the biological sequence. 


6.4. Word count distribution 


Let again w = wy ,-:-we be a word of length ¢ on a finite alphabet A and 
X = (X;)iez be a random sequence on A. This section is devoted to the sta- 
tistical distribution of the count N(w) of w in the sequence X,---X,. First we 
state how to compute the exact distribution in the model M1, using recursion 
techniques. For long sequences, however, asymptotic results are obtainable, and, 
in general, easier to handle. Here the appropriate asymptotic regime depends 
crucially on the length @ of the target word relative to the sequence length n. 
For very short words, the law of large numbers can be applied to approximate 
the word count by the expected word count. This being a very crude estimate, 
one can easily improve on it by employing the Central Limit Theorem, stating 
that the word count distribution is asymptotically normal. This approximation 
will be satisfactory when the words are not too long. For rare words, as a rule 
of thumb words of length @ = logn, a compound Poisson approximation will 
give better results. For the latter, the error made in the approximation can be 
bounded in terms of the sequence length, the word length, and word probabil- 
ities, so that it is possible to assess when a compound Poisson approximation 
will be a good choice. Moreover, the error bound can be incorporated to give 
conservative confidence intervals, as will be explained below. 
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6.4.1. Exact distribution 


If X is a stationary first-order Markov chain, the exact distribution of the count 
N(w) can be easily obtained using the distribution of the successive positions 
(T;)j>1 of the j-th occurrence of w in X1---X,, using the duality principle 


{N(w) > 5} ={Tj <n}. 


The exact distribution of T; can be obtained as in Section 6.3.1, by deriving 
the Taylor expansion of the generating function ®7,(t) of Tj. If 7 = 1, the 
generating function ®7, (t) can be obtained as ®p(t) (see Theorem 6.3.2). We 
just state the result: 


-1 


t! t" 1 
® t)h= ee a Tl” petu-l 
7, (t) Lad a we") +r ) 2 (we, 1) 


ueP(W)U{O} 


As T; — T; is a sum of j — 1 independent and identically distributed random 
variables with the same distribution as D, we have ®7,(t) = ®7,(t)(®p (t))2". 
Now P(T; = a) = g;(a) is equal to the coefficient of t* in the Taylor expansion 
of &7,(t). Using the duality principle, we obtain 


P(N(w) = §) = D5 9j(a) — 9541(a). 
a= 


6.4.2. The weak law of large numbers 


As a crude first approximation, the weak law of large numbers states that the 
observed counts will indeed converge towards the expected counts. Indeed we 
may use Chebyshev’s inequality to bound the expected deviation of the ob- 
served counts from the expected number of occurrences. This approximation is 
valid only for relatively short words, and in this case a normal approximation 
gives more information. Such an approximation will be derived in the following 
subsection. 


6.4.3. Asymptotic distribution: the Gaussian regime 


We assume that X = (X;)iez is a stationary m-order Markov chain on A, 
0<m < €—2, with transition probabilities 7(a1---@m,@m+41) and stationary 
distribution pu(a1-+-@m), @1,---;@m41 € A. For convenience in this particular 
subsection, we consider N(w) = 57", Yi(w) and 


Y; = Y;(w) = I{w ends at position i in X}. 


If the model is known, the asymptotic normality of (N(w) — E(N(w)))/./n 
directly follows from a Central Limit Theorem for Markov chains. When 
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m = 1, the expectation and variance of N(w) are 
E(N(w)) = (n— €+ 1)pa(w) 
Var(N(w)) = E(N(w))+2 S> E(N(ww)) — E(N(w))? 


peEP(w) 
9 n—2é+1 
+ wi(w) S> (n= 2€+2—d)TI"(we, wi) (6.4.1) 
p(w1) a 


where j11(w) is given in Eq. (6.2.4). 

In the problem of finding exceptional words in biological sequences, the 
model is unknown and its parameters are estimated from the observed sequence. 
The expected mean of N(w) is not available and is approximated by an esti- 
mator Nm(w). In this paragraph, we derive both the asymptotic normality of 
(N(w) —Nim(w))/./7 and the asymptotic variance. This is not a trivial problem 
since the estimation changes the variance expression fundamentally. 

The expected mean of N(w) is given by E(N(w)) = (n — + 1)u(w) where 
p(w) = [4m(w) is the probability that an occurrence of w ends at a given position 
in the sequence (see Eq. (6.2.4)). Estimating each parameter by its maximum 
likelihood estimator (with the simplification from Remark 6.1.1) gives an esti- 
mator Nj»(w) of E(N(w)): 


N(wy +++ Wm4i)+++ N(we-m +++ we) 


Nmm(w) = (6.4.2) 


N(wo-**Wm4i)++* N(Weom+** Wei) 


Maximal model Let us first consider the maximal model (m = ¢— 2), which 
is mainly used to find exceptional words. To shorten the formulas, we introduce 
the notation 


Wo = Wyss We] first 2 —1 letters of w 
~wWi= Wo-':wWe last @—1 letters of w 
“wo t= We-++ We &—2 central letters of w. 


Under the maximal model, the estimator of N(w) is 


x _ N(wi-++we-1)N(w2-++we) — N(w 7 )N(Cw) 
e—2(w) = === Mie ee aed 
(we +++ we-1) N(-w-) 

moreover, the asymptotic normality of (N(w) — Np_2(w))/./7 and the asymp- 
totic variance can be obtained in an elegant way using martingale techniques. 
Indeed, Ne_2(w) is a natural estimator of N(w7)a(~w~,we), and N(w) — 
N(w7)x(~w7 , we) is approximately a martingale as it is shown below. 

We introduce the martingale M, = >>;_,(¥i —E(Y¥i|Fi-1)) with F = 
a(X1,...,X;); it is easy to verify that E(M,, | Fn—1) = Mn_-1. Moreover, we 
have 


E(Y; | Fi-1) = P(w” ends at i—1 and we occurs at 7 | F;_-1) 
=I{w” ends at i—1}a("w , we), 
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and 
S"E(% | Fi-1) = (N(w7) — I{w7 ends at n})r(~w7, we). 
i=l 
Therefore, 
=M,, = =(N(w) — N(w")r(-w", we)) 
Vn a -_ 
aw ends at n}a(~w_, we). (6.4.3) 


Note that n~'/?I{w- ends at n}1(~w~, we) tends to zero as n — 00. The next 
proposition establishes the asymptotic normality of M,,/,/n. 


PROPOSITION 6.4.1. Let V = u(w)m("w, we)(1 — 7(" ww, we)). We have 


aM . N(0,V) as N > oo. 
Proof 

This is an application of Theorem 6.8.7 for the one-dimensional random 
variable €,,; = n~\/?(Y¥; — E(Y; | Fi-1)). Three conditions have to be satis- 
fied. Condition (i) holds from E(€,,; | Fi-1) = 0. We then have to check that 
yy, Var(En,i | Fi-1) converges to V as n — oo. Since Y; is a 0-1 random 
variable, we have 


Var(Yi | Fi-1) = E(Yi | Fi-1) — (E(Y%i | Fi-1))” 
= I{w™ ends at i—1}a(~w7, we)(1— 7(" ww, we)). 
We thus obtain 
n 1 n 
$7 Var(En,i | Fis) - S > Var(¥; | Fi-1) 


i=l i=l 


= = Nw" )a(~w, we) (1 —n(—w7, we) 


~<1fw ends at 1— 1}a(~w , we)(1 — 7(" ww, we)) 


—Vasn-o@; 


the convergence follows from the Law of Large Numbers: N(w7)/n = pu(w7 ). 
Finally, |fni| < 45, so that Ve > 0, Wn > 4/e?, P(|n,i| > €) = 0, establishing 
condition (iii). Using Theorem 6.8.7 proves the proposition. rT 


Proposition 6.4.1 and Equation (6.4.3) also yield that 


(N(w) — N(w7)m(~w7,we)) 2+ N(0,V) as n = ov. 


alr 
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We want to prove such convergence for 


Ty = <a (Nu) — Nw" )A(-w" 9), 
where N(- 
Aww) =. 


To this purpose, we decompose T;, as follows: 
“2 (N(w) — N(w7)a(~ w7 , we)) 
vn 

1 


——=N(w7)(#(~ w", we) — ("ww we)) 


MW +61), (6.4.4) 


where M/, is the martingale M/ = 37", (Yi(~w) — E(Yi(~w) | Fi-1)). Now, 
using Theorem 6.8.7 gives 


(ie) (CRE) 


Vor = Vio = lim, * Cr —E(¥; | Fi-1))(Vi(Cw) - EV (Cw) | F..1))) 
ie 


with 


and 
nm 


ak 
Voo = lim — Y_ Var(Y;(~ Fj-1). 
22 = lim — » ar(¥;(~w) | Fi-1) 
With the same technique as for the derivation of V, as Y;Y;(~w) = Yi, we get 
Var = Vig = V and Vag = p(w )a("w_, we) (1 — n("w, we)). Note that the 
Law of Large Numbers guarantees that, almost surely, 


AT gS (6.4.6) 


From (6.4.4)—(6.4.6), we are now able to deduce that T,, converges in distribution 
to N(0, 07_5(w)) with 


2 yy. 5 bw) (wo) \* 
o¢_2(w) = Vir Suey ( == ) V22 
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= num) (2 = AD) (wr we) (a = nw" 0) 
= (uw) = nw) = mw 09) 

= OD (u(-w) = lw) = ww) + wu) 

= AO (Hur) = ww) (um) = ww) 


We have just proved the following theorem. 


THEOREM 6.4.2. Asn — oo, we have 


= (Vw) a Ne-2(w)) 2, N(0, o7_2(w)) 
with 
2 (ij Caer ne) 
L(~w-) 
and 


N(w) — Ne_2(w) 2, N(0,1) 


nG7_o(w) 
where nG?_»(w) is the plug-in estimator of no7_»(w): 


Ne_2(w) 


nGp_o(w) = Nw-)? 


(N(-w) — N(~w)) (N(Cw7) = N(w7)). 
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Non-maximal model In the non-maximal models (m < ¢—2), it is straight- 
forward to extend the previous martingale approach to prove the asymptotic 
normality of (N(w) — Nn(w))/./n and to derive the asymptotic variance. In- 
deed, for each value of £—m, the difference N(w)—N,»(w) can be decomposed as 
a linear combination of martingales, exactly as for T;,. For instance, if w = abcde 


and m = 1, write 


N(abede) — Ny (abcde) = N(abede) — ee 


) 
N(de) 
N(d) 


= N(abcde) — N (abcd) 


N(de) 
' N(d) 


N(cd 
N(c) 


(v (ated — N(abc) 


Soe 


N(d)N(c) 


aes (.v(ate) ~ N(ab)~ ~ 
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Another approach uses the d-method. The idea is to consider N(w) — Nm(w) 
as f(N), where N is the count vector 
N= (N(w), N(wi a “Wait )s tee ,N(we—-m aes we), 


N (wo +++m-41); «++, N(we-m ++ we-s)) 
(see Eq. (6.4.2)). There exists a covariance matrix © such that 


_(N —E(N)) 2 N(0,5). 


n 


The next step is to use the 6-method (Theorem 6.8.5) to transfer this conver- 
gence to f(N): 


1 D 
SU(N) — FED) = NO, VEV'), 
where V = (eee | ) is the partial derivative vector 
Ong EY) j=l,...,2(€—m) 


of f. Since f(E(N)) = 0, we finally obtain 


= ((w) — Nn(w)) 2+ (0, VEV'). 

Jn 

However, this method does not easily provide an explicit formula for the asymp- 
totic variance since the function f and its derivative depends on —m. 

An alternative method is given by the conditional approach. The princi- 
ple is to work conditionally on the sufficient statistic S,, of the model Mm, 
namely the collection of counts {N(a1-++@m41), @1,---;@m41 € A} and the 
first m letters of the sequence. One can derive both the conditional expectation 
E(N(w) | Sm) and the conditional variance of N(w). The key arguments are 
first that the conditional expectation is asymptotically equivalent to Nin(w), 
leading to the asymptotic normality of (N(w) —E(N(w)|Sm))/./n, and second, 
that n~!Var(N(w) | Sm) has the limiting value o?,(w) with 
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nee 2 
o? (w) = w(w) +2 S- p(w w) + p(w)? » n(ar+++ ame)” 
bay sndiis Gm) 
pEP(w), psé—m—1 ayia, 
2 5) Meee) ee). eae 
p(Q1-*+Qm-+1) (wy «+> Wm) 


AL ye+5Aam+1 


where n(-) denotes the number of occurrences inside w, and n(a 1 --+@me) stands 
for Yoye4 (a1 ---Gmb). Since the conditional moment of order 4 of N(w)//n 
is bounded, it follows that 

= (N(w) — Rin(w)) 2+ (0, 02,(w)). 

va ea 
The overlapping structure of w clearly appears in the limiting variance. It is 
an exercise to verify that the limiting variances given by Theorem 6.4.2 and 
Equation (6.4.7) with m = ¢ — 2 are identical. 


Taking the phase into account Both the martingale approach and the 
conditional approach can be extended to the Mm-3 model (see Section 6.1 for 
definition and notation). When one wants to distinguish the occurrences of 
w in a coding DNA sequence according to a particular phase k € {1,2,3} (k 
represents the position of the word with respect to the codons), one is interested 
in the count N(w,k) of w in phase k in X1---Xy,; recall that the word phase is 
the phase of its last letter. Here we state the result in the maximal model. 


THEOREM 6.4.3. Assume X = (X;)iez is a stationary (¢ — 2)-order Markov 
chain on A with transition probabilities 7,(a1---ae—2,b) and stationary dis- 
tribution f(a, +++ ae-2,k), a1,...,@e-2,b € A, k © {1,2,3}. As n — oo, we 
have 


= (ow. k) - Stree aes 2, (0,02_o(w, k)) 
with 
_ __ bw, k) — 
ab a(uyk) = RO (n(n k= 1) = w-wh) 
x (Hew k-1)-p(w,k- 1) 
and 


p(w ,k — 1) = p(w --- we_a, k — 2) p_1 (wi «+: we_2, We-1) 
(wy, k) = ww, k— 1)te(" w_ , we) 
ue, 8) = nur = Tey wy), 


Version June 23, 2004 


278 Statistics on Words with Applications to Biological Sequences 


Error bound for the approximation Using Stein’s method for normal ap- 
proximations, namely Theorem 6.8.1, provides a bound on the distance to the 
normal distribution; however, it does not take the estimation of parameters into 
account. 

Recall v? = Var(N(w)) from (6.4.1), and a given in (6.1.1). One has the 
following result. 


THEOREM 6.4.4. Assume X = (X;)iez is a stationary 1-order Markov chain. 
Let w be a word of length @ and Z ~ N((n—£+1)u(w), v2). There are constants 
c and C1, C2, C3 such that 


< _ < < i 
[P(N(w) <2) ~P(Z<a)| <e min, B., 


where 


B, = 2(4s — 3)v_' + 2n(2s — 1)(4s — 3)v~3(| log v~*| + logn) 
+Cynv*u(w)|ale— 1 
+C2(|loguv~1| + logn)(2s — 1)|als-**1 


+C3(|logv—1| + logn)(n — 28 + 1)np?(w)v-7|a|o-**7. 


The multivariate generalization will be presented in Theorem 6.6.1, where 
the explicit forms of the constants C), C2, and C3 will be given. 


6.4.4. Asymptotic distribution: the Poisson regime 


In the previous section, we showed that the count N(w) of a word w in a 
random sequence of length n can be approximated by a Gaussian distribution 
for large n. This Gaussian approximation is in fact not good when the expected 
count (n — + 1)u(w) is very small, meaning that w is a rare word. Poisson 
approximations are appropriate for counts of rare events. As an illustration, 
it is well-known that a sum of independent Bernoulli variables can be either 
approximated by a Gaussian distribution or a Poisson distribution, depending 
on the asymptotic behavior of the expected value. 

When the sequence letters are independent, Poisson and compound Poisson 
approximations for N(w) have been widely studied in the literature. As we 
will see, a Poisson distribution is not satisfactory for periodic words because of 
possible overlaps; a compound Poisson distribution is proposed. Two classes 
of tools can be used: generating functions, which do not provide any approx- 
imation error, and the Chen-Stein method, which gives a bound for the total 
variation distance between the two distributions (see Section 6.8.2 for details). 
In this section, we chose to present the Chen-Stein approach under a first-order 
Markovian model with known parameters; generalizations to higher order and 
to estimated parameters are presented at the end of the section. No assumption 
is made on the overlapping structure of the word w. 

We assume that X = (Xj)iez is a stationary first-order Markov chain 
on A, with transition probabilities 7(a,b) and stationary distribution p(a), 
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a,b € A. Let w = wi---we be a word of length @ on A. Here, Y; = Yi(w) = 
I{w starts at position 7 in X} and p(w) = E(Y;(w)). Moreover, we make the 
rare word assumption nu(w) = O(1). Note that nu(w) = O(1) also means 
= O(logn). 

Applying Theorem 6.8.2 to the Bernoulli variables Y;, we obtain a bound 
by + b2 + b3 for the total variation distance between the distribution of N(w) 
and the Poisson distribution with mean (n — + 1)u(w) that does not converge 
to 0 under the rare word assumption. The problem comes from the bz term and 
the possible overlaps of periodic words. Indeed, let w be a periodic word; its 
set of periods P(w) is not empty. Take B; = {i -— 204+ 1,...,i+2¢—1} for the 
neighborhood of i € I = {1,...,n—€+1}; then b; and bg tend to 0 as n — +00. 
We obtain 


b= >> SO B(Y:Y;) =2(n— £41) YO p(w w) + O(ney?(w)); 


i€l j€B;\{i} pEP(w) 


this quantity can be of order O(1) if P(w) contains small periods p. The Poisson 
approximation is however valid for the count of non-periodic words because 
the set of periods is empty. For periodic words, the crucial argument is to 
consider clumps, as by definition they cannot overlap. We first prove that 
the declumped count N(w) can be approximated by a Poisson distribution with 
mean (n—£+1)ji(w) (see Eq. (6.2.7)) by applying Theorem 6.8.2 to the Bernoulli 
variables Y;(w) defined in (6.2.5). For simplicity, the variables Y;(w) are denoted 
by Y;. In the next section we prove a compound Poisson approximation for 
N(w). 


Poisson approximation for the declumped count Our aim is to approx- 
imate the vector Y = (Y;(w))ier of Bernoulli variables by a vector Z = (Z;)ier 
with independent Poisson coordinates of mean E(Z;) = E(Y;(w)) = fi(w), where 
ji(-) is defined in (6.2.7). To apply Theorem 6.8.2, we choose the following neigh- 
borhood of i € I: 

B= {j eI: |j -—4| < 3@- 3}. 


The neighborhood is such that, for 7 not in B;, there are no letters X;, common 
to Y; and Y;, and moreover, the X;,’s defining Y; and those defining Y; are 
separated by at least @ positions. It is important to consider a lag converging 
to infinity with n since it leads to the exponential decay of the b3 term given 
by Theorem 6.8.2 as we will see below. Deriving a bound for the total variation 
distance between Y and Z consists of bounding the quantities b;, b2 and b3 given 
in (6.8.1), (6.8.2) and (6.8.3). Bounding b; presents no difficulty: 


bi = > Do E(H)E(Y)) < (n- 6 +:1)(60-5)f?(w) = O (=) . 


5 i nr 
iE€I jEB; 
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Since clumps of w do not overlap in the sequence, VY; = 0 for |y —i| < @ 
Therefore, we get 


1436-3 
ber DY Beye y 
i1€I j€B,\{i} i€I j=ite 


using the symmetry of B;. Now we have 


B(VY)) < EGY) = Rw) (we, wy) A 


and 


2¢—2 
2 re 2 logn 
ba < (n+ 1)Rw)n(w) Jo HP (wes wr) = 0 (ME*), 
p(w) n 
Bounding b3 is a little more involved but we give all the steps because the same 
technique is used for the compound Poisson approximation of the count and will 
not be described in detail there. By definition we have 


bs := )EIE(Y, — B(%) | o(¥%,3 ¢ Bi). 


1E] 


s=1 


Since o(Yj.J € Bi) Co(X,..., Xi-2041, Xizoe-1,---,;Xn), properties of condi- 
tional expectation and the Markov property give 


bs < > EE(Y, YY, | | op 89) 


iel 
= y > JE(Y¥; — E(%) | Xi-2zen1 = 2, Xizze-1 = y)| 
tel x,yEA 


xP(Xj—2041 = ©, Xi42e-1 = y). 


To evaluate the right-hand term, we introduce the set of possible words of length 
€—1 preceding a clump of w: 


G(w) = {g =g1-+-ge-1: for all p € P(w), ge-p++-ge-1 Fw }. (6.4.8) 


Thus a clump of w starts at position 7 in (X;);ez if and only if one of the words 
gw, 9 € G(w), starts at position 7 — +1. Therefore, we can write 


= SO Yi-en(gw). (6.4.9) 


gEG(w) 
This gives 
bs 
<S0 S50 SS BOW Heri (gw) — B(Yi-e41(gw)) | Xi-2e41 = 2, Xizae-1 = y)| 


i€l x,yEA gEG(w) 
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XP(Xji_2041 = @, Xizee-1 = y) 
~ > =, De [P(Xi-2e41 = @, Yi-epi(gw) = 1, Xisoe-1 = y) 


i€l x,yEA gEG(w) 


—p(gw)P(Xj-2e41 = ©, Xi420-1 = y)| 


2)(2, 91) ie H(i.) = nlgw)a (el (99) 


We now use the diagonalization (6.1.3) and (6.1.4), with a given in (6.1.1), 
yielding 


= ep Se oa 


i€l x,yEA gEG(w) 


bs < (n—£+ Ilal® S> wlgw) SD ule) 7 Fate glOstony) 


gEG(w) rjyeA 


=(n—£+I)lal’ $7 u(gw) $7 wa) 


£ ak 
YH EA(e, 91) Qu (wey) 


geG(w) r,yEA 9). Az1,2) 
IAl  4e—2 
Q 
— oar ules) 
t=2 
<(n—l+1)lal’ S> u(gw)y(E,we), 
gEG(w) 
where 
1 ataé, [Al 4e—2 
a) =max > | 1(«) —7 ye — Qi(x, b)Qu (a, y) — yy Q:(x,y)| - 
cA (0) af 
x,yeA (t,t’)A(1,1) t=2 


Note that 7(¢,we) = O(1). From (6.4.8) we have >) egy) M(gw) = H(w) and 
bg < (n— €+ 1)fi(w)y(E, we)al® = O(la4"). 
We have proved the next theorem. 
THEOREM 6.4.5. Let Z = (Z;)ier be independent Poisson variables with ex- 


pectation E(Z;) = E(Y;(w)) = fi(w). We have 


dry (CW), £(Z)) < (n— 0+ | — 5)filw) + 1(6,we)lal’ 
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The declumped count N(w) can be approximated by Ning(w) := 7 tet Y;(w) 
since 


dry (£(N(w)), LNine(w))) < P(V(w) # Ninr(w)) 
< (€—1)(u(w) — lw). (6.4.10) 


Using the triangle inequality leads to the following corollary. 


COROLLARY 6.4.6. Let Z be a Poisson variable with expectation E(Z) = (n— 
£+1)u(w). We have 


dry (CIN w)), £2) < (n— 6+ nf — 5)fi(w) + 416, we) lal 


2 2l-2 
"Tage ~ I*(we, wo} 


+(€— 1)(u(w) — Bw). 


Estimation of the parameters When the transition probabilities are un- 
known and can only be estimated from the observed sequence, we need to eval- 
uate the total variation distance between the word count distribution and the 
distribution of 7,.,kZ;,, the Zj,’s being independent Poisson variables with 
expectation (n — £+ 1)fi,(w), where ju, (w) is the observed value of the plug-in 
maximum likelihood estimator of ji,(w). Similarly, we want to know the total 
variation distance between the declumped count, N(w), and the Poisson vari- 
able with expectation (n — @+ 1)ji(w). For this we use the triangle inequality 
and the fact that the total variation distance between two Poisson variables with 
expectation \ and X’ is less than |A — |: 


dry (L(N(w)), Po((n = €+ Ijulw))) < dav (L(N(w)), Po((n = € + L7i(w))) 
+(n— £4 1)|A(w) — fi(w)|. 


Using the Law of Iterated Logarithm for Markov chains and Equation (6.2.4) 
one can show that 


y] 


fi(w) = u(w) (1 40 (Stews) almost surely (a.s.) 


Under the rare word condition nu(w) = O(1), we get 


nf) ~ npw) = 0 (SREPE™) as 


Now, using Equation (6.2.7), we obtain 


nw) — nfiw) = 0 (ERE) 
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This quantity converges to zero as n — oo, because the rare word condition 
implies that = O(logn). Thus, 


dry (£(N(w)), Po((n — 6+ Ya(w))) 


< diy (£(V(w)), Po((n =e L)Fi(w)) ) £0 — 


Vn 


The approximation follows from Corollary 6.4.8. 

We do not have an explicit bound for this additional error term. However, 
for long sequences the error term due to the maximum-likelihood estimation 
will be small compared to the bound on the Poisson approximation error. 


6.4.5. Asymptotic distribution: the Compound Poisson regime 


Here we present two approaches for a compound Poisson approximation for the 
count. Firstly, such an approximation can be derived using a Poisson process 
approximation for the Bernoulli variables Y;,,(w) defined in (6.2.9) and by using 
that N(w) is asymptotically equivalent to )),<; oys1 kY;n(w) in probability. 
For simplicity, the variables Y;,(w) are denoted by Yi. Secondly, a direct 
approximation for N(w) can be obtained using Stein’s method for compound 
Poisson approximation. The second method yields better bounds on the approx- 
imation, whereas the first method is easier to generalize to multivariate results, 
as will be shown in Section 6.6. 


Compound Poisson approximation via Poisson process To approxi- 
mate the distribution of the count N(w), we first use that N(w) is asymptoti- 


cally equivalent to Nine(w) := 02 *" Veo kY;,, in probability: 


dry (L(N(w)), £(Nine(w))) < P(N (w) 4 Nins(w)) 
<2(€—1)(u(w) —filw)). (6.4.11) 


Our goal is now to approximate the vector (Yin )arer I= {1,...,n-—€4+ 
1} x {1,2,...}, of Bernoulli variables by a vector (Z;,x)(i,n)er With independent 
Poisson coordinates of expectation E(Z;;,) = E(Y;,%) = j[ix(w) where fix(-) 
is given in Equation (6.2.10). The neighborhood B;,, of (7,k) is such that, 
for (j,k’) not in Bj.,, the letters X),’s defining Yin and those defining Yj x 
are separated by at least & positions. Since Yin can be described by at most 
Xi—et15++ +s Xig(k+1)(e—-1), We consider 


Biz = {,k') © 1: -(k' +3)(€-1) <j -i< (k+3)(C-1)}. 


We bound successively the quantities given in (6.8.1), (6.8.2) and (6.8.3). By 
definition 


by = x: > E(Yi,4)E(¥j,«) 


GK)EL (9, )EBi, ke 
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n—l+1 i+(k+3)(€—1) 
< VMM DYE italw)iiv (w) 
i=l k>1k/>1 j=i—(k’ +3)(€-1) 
< (n—£41) 9° So (b+! +6)(€-1) +1) fin (w) fiw (w). 


k>1k’>1 


From (6.2.7) and (6.2.10), we use that 


do Fix(w) = fiw), (6.4.12) 
k>1 
> Fiix(w) = ww), (6.4.13) 
k>1 


to obtain 
by (n= £4 1)(2(C— 1)flw)ulw) + (6e—5)R()*), 


The bg term involves products such as Vin Yj ni with (j,k’) € Bin. Since 
a k-clump of w at position 7 cannot overlap a k’-clump of w, many of these 
products are zero. To identify them, we need to describe in more detail the 
compound words c € C;,(w) and c’ € Cy (w) that may occur at positions 7 and J. 
For this purpose, we introduce the set of words of length ¢— 1 that can follow 
a clump of w: 


D(w) = {d= dy +++ de_1: Vp € P(w), di +++ dp F We-p4i + Wet. 
Therefore, we can write 
Y¥in(w) = =, Vise GOd). (6.4.14) 
gEG(w),cEC;,(w),dEeD(w) 


For convenience, we write }),,q for the sum over g € G(w), cE Cx(w), d€ D(w), 
and, similarly, })4-.:q for the sum over g’ € G(w), c’ € Cy(w) and d’ € D(w). 
This gives 


= i Y, / 
Bos So Yo EM Ye) 


(i,k)ET (9,k/)ET\{(4,k) } 
n—l41 i+(k+3)(€—-1) 


= )>X.>D»D z E(Yi—e41(ged)¥j-241(9'c'd’)). 


i=1 k>1k'>1 ged g’c'd' j=i—(k! +3) (€-1) 


For i —|c'| < 7 <i+|c|, we have that Yj-¢41(gced)Yj~241(9’c'd’) = 0 because 
clumps do not overlap. We distinguish two cases: 


(1) g’c'd’ at position 7 — +1 overlaps gcd at position 7 — £+ 1 (this is only 
possible over at most 2(¢— 1) letters); that is, for 


j €f{i—|c'| —2043,...,0- |e |} U {iF |el,...,0 + |e] + 20-3}; 
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let bz, denote the associated term. 


2) g’c'd' at position j — +1 does not overlap gcd at position i— ¢+ 1; that 
g J g 
is, for 


j € {i-—(k' +3) (€-1),...,i-|e'] -20+2}U {i+ |e] +20-2,...,i+(k+3)(€-1)}; 
let bg denote the associated term. 
By symmetry, we have 


n—£+1 i+|c|+2e-3 


dn <2 5° SSSI YS Ep eri(9Cd)¥j—-e41(9'C'a*)). 


t=1 k>1k'>1 ged g’c'd’ j=i+|e| 


Summing over k’, g’, c’ and d’ gives 


n—£+1 i+|c|+2e-3 


boy <2 .o i S- x E(Yi-c+1(ged)¥j(w)) ; 


i=1 k>1 ged j=it+le| 


now, summing over d and using that Y;(w) < Y;(w) leads to 


n—l+1 i+|c|+2e—3 


bo <2 S- S- ye S- E(Yi-¢+1(9c)¥j(w)) - 


i=1 k>1 ge j=itle| 


An occurrence of gc at position i — + 1 does not overlap an occurrence of w at 
position j > i+ |c|; thus it follows that 


E(Yi-e+1(9e)¥j(w)) = plgc)IP- (aya) ; 


and 
(wo) 20-2 
boy < 2(n—L+ ji > II° (we, wi) » S> u(ge)- 
p(w1) s=1 k>1 ge 
Finally, note that 
32 So ulge) = 32D fixe (w) = ST bie (w) = p(w), 
k>1 ge k>1k*>k k*>1 


which leads to 


y?(w) logn 
boy < 2(n —L+ 1) YT (wesw) =0 (2) . 


ww) S 


The b22 term is easier to bound and we get 


baa <2(n—€ + 1)? ((¢— 2)y(u) + w)) = 0 (E*), 


Lmin nm 
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where fimin is the smallest value of {u(a),a € A}. 
Combining these bounds, we have 


u2( = 2 ji(w) 
ye we, 1) +2(n—£4+1)5 


by < 2(n— aye ((€— 2)u(w) + f(w)) 


a 


Lmin 


Bounding b3 consists of following the different steps previously detailed for 
the declumped count and using the decomposition (6.4.14) instead of (6.4.9). 
Since there is no interest in repeating this technical part, we just give the bound 
of b3 and state the theorem: 


bs < (n—£+ 1)fi(w)y2(€)lal” 


with 
1 os 
y2() = So ala) re \ Ok dy |g Ge, 8) Qu (a, i) 
x,yeA (t,t ACT) 
IAl | 5¢-3 
+5°|+-Qi(z,y) 
t=2 


THEOREM 6.4.7. Let (Z;,x)(i,n)er be independent Poisson variables with ex- 
pectation E(Z;,,) = E(Y;,4(w)) = [ix(w). With the previous notation, we have 


dry (£((Fin(w mer) — 


< (n= 6+ 1m) (2(E— Dylw) + (60 5)R(w) + 12a") 


+2(n —€+ Hf ce 2 ym asd (= 216) +0). 


p(w Lemin 


From the total variation distance properties, we have 


anv (c (S° kY i) k) Ly kZi,k) )) < dry (£((Fin(w))ci.er), £((Zi.n)(ine1)): 


(i,k)EL (i,k)EL 


Since the Z;,,’s are independent Poisson variables, ee ker kZ;,~, has the same 
distribution as }>,.,kZ,, where the Z;,’s are independent Poisson variables 
with expectation (n—£+1)ji,(w). Note that the latter has a compound Poisson 
distribution with parameters ((n — + 1)f(w), (fie (w)/f(w))x). Because of the 
expressions of fi(w) and fiz(w) given by (6.2.7) and (6.2.10), this compound 
Poisson distribution reduces to a Polya-Aeppli distribution. Using the triangle 
inequality leads to the following corollary. 


Version June 23, 2004 


6.4. Word count distribution 287 


COROLLARY 6.4.8. Let (Zi)x>1 be independent Poisson variables with expec- 
tation E(Z;,) = (n— + 1)fiz(w). Let 


CP = CP ((n — 0+ 1)p(w)(1 — A(w)), (1 - A(w))A**(w)) 54) (6.4.15) 


denote the compound Poisson distribution of asi kZ;,. With the previous 
notation, we have 7 
dry (L(N(w)), CP) < 


(n— + 1)a(w) | 2 — Lew) + (62 — 5)u(w) + ra(O)al*) 


2(qy) 2022 w 
+2(n —£€+1) mae 2» II*(we, wi) + — ((€ — 2)u(w) + fu(w)) 
+2(€— 1)(u(w) — f(w)) 
2g ee 


Such a bound on the total variation distance between, for instance, the word 
count distribution and the associated compound Poisson distribution has the 
great advantage of providing confidence intervals (see Section 6.8.2). Indeed, 
using notation from Corollary 6.4.8, for all t € R, we have 


P(N(w) >t)-P| SokZ, >t} | < dry | LIN(w)),£ | So ke 
k>1 k>1 


Direct compound Poisson approximation Empirically, often a compound 
Poisson approximation also gives good results when the underlying words are 
not so rare, indicating that the theoretical bounds are not sharp. Using the 
direct compound Poisson approximation Theorem 6.8.4, it is possible to obtain 
improved bounds for N(w). For this, choose as neighborhoods in Theorem 6.8.4 
B(i,k) = {(j, kh’): -(k’ —2)(€-1) -r+1<j-i< (k+2)(€-1)+r-1}, 


where r > 1 can be chosen. In Theorem 6.4.5 we had r = @. Recall (6.2.10), p 
from (6.1.5), [ from (6.5.5), and CP from (6.4.15). One obtains the following 
result. 


THEOREM 6.4.9. If A(w) < 4, then 


dry (L(N(w)), CP) < eae (Ar t+Jf(n—e4+ Tw) Ao ) + 2(¢—1)u(w), 
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where 


Ao = 29" (2 a 3pse-Dtr 4 2p") 


Ay = 2p:(w) {3e- 1) +r+2(€- SS 


2A(w)l—1l—=po) , 2-1) +r=-1 
on ( (@—-Awy + @—AtwyP )} 


and po is the shortest period of w. The value r can be chosen to minimize the 
estimates. 


Estimation of the parameters When estimating the parameters, as in Sub- 
section 6.4.4, the total variation distance between the two compound Poisson 
distributions is bounded by 


dry | L(Y kz. | LE So kZE | | < Se |nftg(w) — nfte(w)]. 


k>1 k>1 k>1 


Using Equation (6.2.10), this quantity tends to zero as n > co when nu(w) = 
O(1). 

Again, for long sequences the error term due to the maximum-likelihood 
estimation will be small compared to the bound on the compound Poisson ap- 
proximation error. 


Generalization to Mm _ Let us now assume that the sequence (X;)jez is a m- 
order Markov chain on the alphabet A, with transition probabilities 7(a1 +--+ am, 
Qm+1); 1,°**;@m41 € A. The basic idea is to rewrite the sequence over the 
alphabet A” using the embedding (6.1.6), 


X= Xi Xi41°°+ Xitm-1; 


so that the sequence (X;)jez is a first-order Markov chain on A™ with transition 
probabilities (A = a1 --+-@m,€ A™ , B= b1---bm € A™) 


— f m(a1-++ Am, bm) if a2-++ Gm = b1 +++ bm—1 
TI(A, B) {4 otherwise. 


Denote by W = W,--- We_m+41 the word w = w,... we written using the alpha- 
bet A”, so that W; = w;...wj4m-1. The results presented below are valid for 
the number N(W) of overlapping occurrences and the number N(W) of clumps 
of W in X1---Xp—m41. Since an occurrence of w at position 7 in X1---Xp, 
corresponds to an occurrence of W at position 7 — m+ 1 in X1---Xy,—m4i, 
we simply have N(w) = N(W). In contrast, clumps of W in X1---Xp—m4i 
are different from clumps of w in X1---X, because W is less periodic than 
w, leading to N(W) # N(w). Let us take a simple example: w = ata and 
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m = 2. Put A = at € A? and B = ta € A’; we then have W = AB. The 
sequence tatatatat contains a unique clump of at whereas the associated se- 
quence BABABABA contains 3 clumps of AB. Indeed, AB has no period and 
ata has one period. In fact, the periods of W are those periods of w that are 
strictly less than €— m+ 1. Therefore, the Poisson approximation for the de- 
clumped count in a m-order Markov chain does not follow immediately from the 
case m = 1; a rigorous proof would require applying the Chen-Stein theorem 
with an adapted neighborhood and to bound the new quantities b;, b2 and bs 
in Mm, but this has not yet been carried out. 

Since N(w) = N(W), Corollary 6.4.8 ensures that N(w) can be approxi- 
mated by a sum )7,., kZp, where Z; is a Poisson variable whose expectation is 
(n—€+1) times the probability that a k-clump of W starts at a given position 
in X1---Xp;p-m+1. From Equation (6.2.10), we obtain 


E(Zx) = (n—£+ 1)(1— A'(w))? A"(w)** n(w) 


with 


wP)w 


peP!(w)UfI,....0—m} 


An important consequence is that, in Mm, the compound Poisson approx- 
imation for words that cannot overlap on more than m — 1 letters becomes a 
single Poisson approximation. 


6.4.6. Large deviation approximations 


For long sequences, the probability that a given word occurs more than a cer- 
tain number of times can be approximated using a Gaussian or a compound 
Poisson distribution (Sections 6.4.3 and 6.4.5). The aim of this section is to 
show that large deviation techniques can also be used to approximate the prob- 
ability that a given word frequency deviates from its expected value by more 
than a certain amount. Let w = w,---we be a word of length @; recall that 
yu(w) denotes the probability that w occurs at a given position in X,--- Xp. 
We aim to provide good approximations for P( N(w) > p(w) + b) and 
P(N (wv) < u(w) — 6) with 0<b <1. 

We assume that X,---X, is a stationary first-order Markov chain on a finite 
alphabet A with transition probabilities (a,b) > 0, a,b € A. (Generalization to 
Mm follows the same setup as in Section 6.4.5, using (6.1.6).) To use Theorem 
6.8.6 for sai (w), we need to consider the irreducible Markov chain X,, 
..., Xp_ez1 on A’ where X; = X;--+Xi4e-1, with transition matrix II = 
(II(u, v))u,ve ae Such that 


—i_ 
n—l+1 


T( Ue, VE if ujay =v;, 7 =1---l—-1, 
me ae =u ~ “i caer ’ 
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The count N(w) can then be written as 


n—£+4+1 
N(w) = > U{X;- ++ Xige-1 = Wi1 ++ we} 
ik 


Let J be the function 


I(x) = sup(6x — log \(6)), 
6eR 


x € R, where (8) is the largest eigenvalue of the matrix Ig = (o(u, v)) uve ae 
defined by 

e"I1(u,v) ifu=w, 
es ae otherwise. 


Let 0 < b < 1; applying Theorem 6.8.6 with the function f(u) = I{u = w} to 
the closed subset [(w) +, +00] and the open subset (ju(w) +b, +00), we obtain 


1 
P | ———_N > bj) =-I : 
(Nw) = ww) +) = Male) +0); 
indeed, the rate function J is convex and minimal at E(f(X;)) = u(w). Similarly 
we have 


Denoting the observed count of w in the biological sequence by N°S(w), as 
a consequence we have for large n: 


: obs ; wobs i 
if N°*(w) > (n—£+ 1)u(w) and b:= X_™) 


— p(w), then 


PUN (u) > N(x) xe (—(n— e+ 1) (APY), 


obs 5 Falah Ww 
if N°>S(w) < (n—£+ 1)u(w) and b := p(w) - ~—, then 


P(N(w) < N°°8(w)) © exp (-( ae ee (—)) 


Note that this approximation is obtained assuming the transition probabilities 
m(a,b), a,b € Aare known. Moreover, since \(0) is an eigenvalue of a |.A|° x |.A|* 
matrix, the word length @ is a limiting factor for the numerical calculation, even 
if |A| = 4. 
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6.5. Renewal count distribution 


As a particular case of non-overlapping occurrence counts, in this section we 
count renewals of a word w = wiw2...we in arandom sequence X 1 ---X,, as de- 
fined in Section 6.2. We then consider the renewal count R,,(w) = oo I; (w), 
where I;(w) is the random indicator that a renewal of w starts at position 7 in 
X1-++Xy, (see (6.2.11)). 

Exact results for the distribution of R, have been proposed using a combi- 
natorial approach and language decompositions. Because those tools are very 
different from the ones used in this chapter, we only present asymptotic results. 
First we derive the expected renewal count. 


Expected renewal count If the random indicators I;(w) had the same ex- 
pectation, say pr(w), then E(R,(w)) = (n—£4+1)ur(w). This is the commonly 
used expectation, but it ignores the end effect. For i > ¢, the I;(w)’s are effec- 
tively identically distributed by stationarity of the Markov process, but it is not 
the case for 1 <i < @. 

We start with the calculation of wr(w). Recall that P(w) is the set of 
periods of w and that w?) = wrws:: ‘Wp denotes the word composed of the 
first p letters of w. When the Markov process is in stationarity, we have from 
renewal theory that 


paw) = (6.5.1) 


with Q given in (6.2.2). To understand this formula, note that we can decompose 
the event {there is an occurrence of w starting at position i}, 7 > 2, as the disjoint 
union of {there is a renewal of w starting at position 7} and {there is a renewal 
of w starting at position 7 directly followed by the letters we_j+j41---we and 
j —tisa period of wh, for j € {i -€+4+1,...,i—1}. This can be written as 
follows 


Yo Gy (w)¥j4e(we-inj ti we) T{i — 7 € P(w) U {0}} 
j=i-l41 


S> T;—p(w)¥i4e—p(we—p41 + We). 
PEP (w)L{O} 


Yi (w) 


l| 


I 


Taking expectations on both sides thus gives 
uw) = S2 rw) u(we—p ++ we) ———~ 
pEP(w)U{0} 
Hence 


L(w) 


wv) = —— SST. 
Bee 1+ pep (w) T(We-p, We-p+1) * +» T(we-1,; We) 
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which gives the result (6.5.1). 

As previously noted, the first variables I,(w), ..., Ig(w) are not identically 
distributed because of boundary effects. For the asymptotic results we are 
interested in in this section, this end effect may be ignored. 


6.5.1. Gaussian approximation 


Once the asymptotic variance is established, the normal approximation follows 
from the Markov Renewal Central Limit Theorem. Calculating the asymptotic 
variance is a little more involved than calculating the mean, relying much on 
the autocorrelation polynomial. To this purpose, we define 1 as the Card(A) x 
Card(A) matrix where all the entries equal 1. With II denoting the Markovian 
transition matrix, put 


Z= yi =i). (6.5.2) 


Ol) ple) 


We then have the following Central Limit Theorem. 


o” = p(w) (a — 20) 4220 5 2Altwe, wi) =) 


THEOREM 6.5.1. We have that, as n — oo, 


R,(w) Seat ”, N (0,0). 


The main technique to prove this theorem being generating functions, no 
bound on the rate of convergence is obtained. Note also that we do not have a 
corresponding result when mean and standard deviation are estimated. 


6.5.2. Poisson approximation 


Similarly as with the declumped count, we can also derive a Poisson approxima- 
tion for the renewal count under the rare word condition ny(w) = O(1). Indeed 
this is very simple. Recall (6.2.3) 


Y;(w) := I{w starts at position 7 in X}. 
We can write, for i > £, 


iw) =¥w) TT -1(w)) 
jri—l41 
=¥w) TL 0-¥itw)) 


j=i-e41 
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+¥,v)( TT G-b@)- TT a-%@) 
j=i-l+1 jai-l4+1 
=%w)+¥w)( TT a-o)- TT a-%@) }os3) 
j=i-f4+1 j=i-ét+1 


whereas I;(w) = Y;(w) Ta — Y;(w)) if 1 <i< @ Note that a renewal 
occurrence in the first @ positions is a clump occurrence observed in the finite 
sequence, and conversely. Thus we have 


n—l+1 
R,,(w) = > 1;(w) 
7 n—l+1 i-1 i-1 
=N(w)+ >> ¥iw){ Jf G-L@)- JT a-Y¥)) 
i=l+1 jai-l4+1 j=i-l4+1 


We have already derived a Poisson approximation for the number of clumps 


N(w) (see Section 6.4.5). Let us consider the difference 


n—l+1 i-1 i-1 
Rn(w)—N(w)= SO %w)| TT @-f@))- J] @-Y¥ie)) 
i=C41 j=i-€4+1 j=i-€4+1 


For a summand to be nonzero, firstly we need that Y;(w) = 1. Note that a 
renewal always implies an occurrence, so that 


TL @-p@ > TT yw). 
j=i—f41 jai—€41 


The product being always 0 or 1, the two products are different if and only if 
Tjzi—e41 1-1 (w)) = 1 and T]=;_»,,(1— ¥j(w)) = 0. This implies that there 
is no renewal between the positions i— @+1 and 7—1, but that there must be an 
occurrence not only at position i but also at some position 7 between 7 — @+ 1 
and i—1. This occurrence again cannot be a renewal, so that it must be part 
of a larger clump; repeating this argument we see that the occurrence at 7 must 
be part of a clump that started before position i—¢+1. This implies that there 
had to be an occurrence of w somewhere between i — 20+ 2 and i — @, and this 
occurrence is in the same clump as the occurrence at i. Thus 


n—l+1 i-k 


P(Rn(w) #N(w)) << SO SS BM (w)¥;(w)) 


i=l+1 j=i—2042 


< (n— 204 1)(- Vu(w)? 


p(w) 
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This quantity will be small under the asymptotic framework nu(w) = O(1). 
Thus we may use the Poisson bound for the number of clumps derived above, 
and just add an error term of order logn/n. 


A different type of bound is also available. Put 


(t) 
PST Ssug-— (we, w1) 


ue tin) (6.5.5) 


Recall p given in (6.1.5), and E(R,,(w)) is given in (6.5.1). Using the Chen-Stein 
method, it is possible to prove the following theorem (see Section 6.9). 


THEOREM 6.5.2. We have that 


dry (L(Rn(w)), Po(E(Rn(w))) < (1 = eB) Di 


where 


Dy = (20-5) (ji(w) + Pp(w)) — P (26 — 1)u(w), 
Dz = 2E(Rn(w))p* (2+ 2p° + p***), 


Da= (1 + min {1, (E(R,(w))) 72 \) (E(R,,(w)) — B(N(w))). 


It is also of interest to consider the case that n — oo, for a sequence of words 
w™ of length 0”, where 0) may grow with n. Indeed, under the conditions 


(i) limp oo E(Rn(w™)) = A < 00 


er (n) 
(ii) Timpsoo _ = 0, 


the bound in Theorem 6.5.2 is of order O (=). which converges to zero for 


n — oo. Thus R,,(w\”) converges in distribution to a Poisson variable with 
mean 4. 


6.6. Occurrences and counts of multiple patterns 


In biological sequence analysis often the distribution of the joint occurrences of 
multiple patterns rather than that of single words is of relevance, for example 
when characterizing protein families via short motifs, or when assessing the 
statistical significance of the count of degenerated words such as a(c or g)g(a 
or t), describing the family of words {acga, agga, acgt, aggt}. 

Since the exact distribution of the counts of multiple words is not easily 
calculated in practice, we will focus in this section on the asymptotic point of 
view. 
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Indeed, asymptotic results, similar to the above approximations, are avail- 
able for the distribution of joint occurrences and joint counts of multiple patterns 
and we will present them in this section. As we will see, the main new feature 
one has to consider are the possible overlaps between different words from the 
target family. 

Consider the family of ¢ words {w',...,w%}, where w” = wiw)--- wy. For 
two words wt = wyw3---w;, and w? = wiw5---w7, on A, we describe the 
possible overlaps between w! and w? by defining 

Pw 0 t= {pe (hc 1) re) aw 1 =, Cr — PAG} 
Thus P(w!, w?) 4 0 means that an occurrence of w? can overlap an occurrence 
of w! from the right, and P(w?,w') #4 @ means that w? can overlap w! from 
the left. Note the lack of symmetry; for example, if w! = aaagaagaa and 
w? = aagaatca, we have P(w!, w?) = {4,7,8} and P(w?, w!) = {7}. To avoid 
trivialities, we make the following assumption. 


(A1) Vr #r’, w" is not a substring of w” . 


Thus {w',...,w%} is a reduced set of words. Again we model the sequence 
{Xi}iez as a stationary ergodic Markov chain. 
We introduce the notation 
l= 6.1 
Ba" iia 


Emin = min £,. 
l<r<q 


6.6.1. Gaussian approximation for the joint distribution of multiple 
word counts 


We assume the general model Mm, m < min — 2. We will show the asymptotic 
normality of the vector n~/?(N(w") — Nm(w"))r=1,....¢: 


J (N(w") — Nin(w")) 


To prove this result, we use a multivariate martingale central limit theorem. 
The estimated count N,,(w") is given by (6.4.2). The novelty consists here of 
deriving the asymptotic covariance matrix Uj, = (Um(w", Ta) ee 

Suppose all the words w” have the same length @ and m = ¢—2 (the maximal 
model) then the martingale technique (see Section 6.4.3) leads to 


De-2(w",w") = p(w") u(w" ) ee _ He) = wy 


p(w") p((wr)~) 
Ww) = (wt (w") =" (w")7} 
u(~(w")) u(~(w")~) 


Note that when r = r’, this formula reduces to the asymptotic variance o7_,(w”) 
of Section 6.4.3. 
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More generally, for r 4 r’, the conditional approach (see Section 6.4.3) leads 
to 


Suu") = Sw (uryur’) + Sw (ur yOw") 


pEP(w",w™ ) pEP(w™ ,w”) 
plp—m—-1 pxl,r—m—1 


Lay +++ Gm) 


+1(w" )u(w") ( > Rlar:ame)n/(ar + ame) 


yy Measstmss)o"(aa sams) _ mul ae) 
ans) wwe wh) 


Tw +: Why = wh wh} = 2! (wh wine) 
p(w +++ Wh) 


where n(-) denotes the number of occurrences inside w” and n/(-) denotes the 
number of occurrences inside w”’ . (When r = r’, the formula reduces to Equa- 
tion (6.4.7).) 

Note that, if one wants to study the total number of occurrences of a word 
family {w",r =1,...,q}, we have 


F (>: N(w") -SoRatw) 25N (0, 5°En(w",w") 


Error bound for the normal approximation Similarly to Theorem 6.4.4, 
it is possible to give a bound on the approximation when the parameters do not 
have to be estimated. Let w = {w',...,w%} be the word set and 


N(w) = (N(w’),..., N(w%)) 
be the vector of word counts. Denote its covariance matrix by 


Ly = Ln(w) = Cov(N(w)) = (Cov(N(w'), N(w4)) 


i,j=1,...q° 


A calculation similar to (6.4.1) shows that, for two different words u and v of 
length ¢,, and @, such that u is not a substring of v and v is not a substring of 
U: 


Cov(N(u), N(v)) = 
Yo E(N(u2)) + SO E(N(v u)) — E(u) E(N(0)) 


pEP(u,v) peP(v,u) 
+ p(w) se St en ey ee T4(ue,,v1)  T4(ve,, v1) 
U v n— bly — by _ we Mba Ua oe Meas MA) | 
MU) 2. CT mon 


In particular, £,, is invertible. 
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Some more notation is needed. Let H denote the collection of convex sets 
in R?, and let 


B= B(w) = foes ale!) 


Recall, from the transition matrix diagonalization, a given in (6.1.1) and Q; 


given in (6.1.2). Using Theorem 6.8.1, it is possible to derive the following 
result. 


THEOREM 6.6.1. Assume the Markov model M1. Let Z ~ N(E(N(w), Ln). 
There are constants c and C,C2,C3 such that, for any set w of gq words with 
maximal length ¢, 


sup |P(N(w) € A) —P(Ze A)| <c min Bg, 
ACH fSs<F 


where 


i 
2 


B, = 2q? (4s — 3) (£51 


+2q'n(2s—1)(49—8) (¢ lent) (Itos(o? yen") + 108) 


AW or Jon? Bale 


+C2 (Inoe(o? ea Dl +logn } (2s — Hilal? 


ue ( log(a?/ |£n'|)| + logn } (n — 28+ Ingis? [£1] al. 


Here, 

C;, = max S- p(a)C1,1(a, 6), > Ci,2(a), S- Ci,3(a) 

a,beEA acA acA 
C2 = n|Ly"|q*B(26 + 1)C, 
C3 = max |Qe(a, 6)| 
abed | (a) 
and 
3 ( b, 
C1,1(a,b) = max, > 1Q:(a, 2)Qu(, Wl fl a lQi(a, b)| 
TYE | 150 OF t>2 H(2) 1>2 


C1a(a) = max 4 S726, 4) 


t>2 
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The constant c is not easy to describe. Note that convergence on the class 
of convex sets is not as strong as convergence in total variation. Indeed, ap- 
proximating discrete counts by a continuous multivariate variable might not be 
expected to be very good in total variation distance. 


6.6.2. Poisson and compound Poisson approximations for the joint 
distribution of declumped counts and multiple word counts 


We assume the model M1 since generalization to Mm follows the single pattern 
case. To give a bound on the error for a Poisson process approximation for 
overlapping counts, define the following quantities for all r and r’ in {1,...,q}, 
and for alla € A: 


3£-—£,—2 


= SH, 
s=1 


ly, +£,7—2 


r 


Qyr = > IL’, 
s=l 


1 ; j 
as S- urea) Eren, 


M(w",w" ) peP(wr wr’) 
0 ifr=r’, 
eae wy pty {Det (Wh,, WT) ee 
Ti(w",w" ) = (2n — bl, — bp + 2)u(w") p(w" ) ey sw") |, 
1 


Taw" uw") = (n= 6. 41)((E= 1) Gu" Yalu") + lw" aw") 
+(60—S)fitw" lw") 


1 ( Qnp(wh wt) — Aner (wh, wh) 
T3(w", w" ) = (n = Ci + 1) p(w") p(w" ) a + eae 


u(wr’ ) pwr) 
eee aw" iw") (6.6.2) 
(n=, +1)(C-2) 


PEE? (tur tw") + nw" ylw")) 


Lmin 


+(n — lp + 1)u(w")u(w” ) (r(w", w")+ M(w" ,w")) ; 


1 Oy Oy 
v1 (Lr, £, a) a yy (2) Wek (bd) S- 7; — Q(z, 6)Qz(a, y) 
x,YyEA (t,t’)A(1,1) 
Al 4e—2 
a 
-»— : 7] Qi(x y) ’ 
t=2 
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1 ater g2l- er 
qa(lr,£) = S> plz) max }|—~ Sl [44 —Qi(2,0)Qu(a,y) 
a,beA \ p(b) a 
x,yeA (t,t A(1,1) 
IAl | 5e—3 
+0 : L Qi(z,y) 
t=2 


Here we choose as index set I = {1,2,...,q(n+1)— °7_, és}; it can be 


written as the disjoint union I = *_, I, with 
r—1 
n={o 1)(nt+1)— 504, 41,...,7(n+1)- -Yot \. (6.6.3) 
s=1 
We define [7] by 
r—1 
[i] :=4-(r-1)(n+1)+5 0%, with r=r(i) such that i€I,. (6.6.4) 
s=1 


Joint distribution of declumped counts To apply Theorem 6.8.2, the 
Bernoulli process Y = (Y;)ier and the Poisson process Z = (Zj);er are given 
by 

Yi= Yigq(w"), 

Zi ~ Po(pi(w")), (6.6.5) 
where 7 is such that i € I. For 7 € I, we choose the neighborhood B; := {7 € 
I: |[3] — [é]| < 32-3}. 

Then the following results can be proven. Recall the notation (6.6.2), (6.6.5), 
and (6.6.1). 


THEOREM 6.6.2. Under assumption (A1) we have 


dry (LY), £(Z)) 


2 
< (w= hen 1)(60—5) (Sse 9) + S> T(w",w") 


l<r,r’<q 


qd 
+lal* Soler, £, we, )(n — br + Lji(w"). 


ea 


COROLLARY 6.6.3. Let (Z,)r=i,....m be independent Poisson variables with 
E(Z,) = (n — £, + 1)p(w"). With the previous notation and under assump- 
tion (A1), we have 
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2 
< (nm — Lin + 1)(62 —5) (Soa y S> Ti(w",w") 


1<r -_ 
+lalé > teat ine ie +L —1)(u(w") — fi(w")) . 
r=1 
The proof is a direct application of Theorem 6.8.2, similar as in Section 6.4. 


Distribution of multiple word counts Ina similar way a compound Pois- 
son approximation for the numbers of occurrences can be obtained. Choose as 


index set F 
f= [dona (@+1)= -Soa| x {1,2,...}. 


To apply Theorem 6.8.2, the Bernoulli process Y — (Yin )aner and the Poisson 
process Z = (Z;,k)(i,n)er are now defined as 


Yi, k= Yia.n(w"), 
Zin ~ Polir(w")), 
where r = r(i) is such that 7 € I,; I, and [i] are given by (6.6.3) and (6.6.4). 


For (i,k) € I, the neighborhood is still By, := {(j,k’) © 1: —(k’ + 3)(€-1) < 
Li] — [i] < (& + 8)(€— 1)}- 


We make the following weak assumption on the overlap structure. 
(A2) Vr #r’, w” is not a substring of any composed word in C2(w” ). 


THEOREM 6.6.4. Under assumptions (A1), (A2) and with the notation (6.6.2), 
we have 


dry (C@).LZ)) < Y Pa(w' sw") + SO Blw",w") 


l<r,r’<q 1srr'<q 


+lal® S 7 y2(6-,€)(n — br + 1)fi(w"). 


r=1. 
The following corollary is easily obtained. 
COROLLARY 6.6.5. Let (Zi)x>1 be independent Poisson variables with expec- 


tation E(Z,) = 3-4_,(n — £, + 1)f,(w"); CP denotes the (compound Poisson) 
distribution of 7,5, kZ,. With the notation (6.6.2) and under assumptions 
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(Al), (A2), we have 


trv (e (Spe) cr) < o> To(w",w") + S- T3(w",w" ) 


lsr,r'<q lsrr'<q 


+lal? >) 2(ér, (nm — br + 1)(w") 


-2 (lr — 1) (filw") — p(w") - 


r=1 


Again, empirically the compound Poisson approximation may perform better 
than the bound suggests, in the case of not so rare words. 
Expected count of mixed clumps For the family w = (w!,...,w%) of 
words it is also interesting to consider the number of mized clumps of occur- 
rences. Let 


Ye(w) = DO Yi(w") I <7 vey} 
T= j=t—L,p41 (ia 


that is, Y,“(w) = 1 if there is an occurrence of a word from the family w at 7, and 
if there is no previous occurrence of any word in w that overlaps position 7. ‘Thus 
the mixed clumps can be composed of any words from w, whereas for Y; the 
clumps are composed of the same word. Note that, for g = 1, Y,°(w) = Y;(w'). 
Let 


be the number of mixed clumps in the sequence. To calculate E(.V“°(w)), intro- 
duce the quantities 


3 e (Oy, dteaiss 


u(wy, ) 


ers = ’ 


pEP (wwe) 
where the summands are the probabilities of observing the last (¢; — ¢- + p) 
letters of w* successively given that the last letter of w” has just occured. It 
can be shown that 
q 


E(N°) = (n—£+1) > yr, (6.6.6) 
r1 
where (y1,.-., Yq) is the solution of the gq x q linear system of equations 


q 
ens = p(w), s= Te aceuiy Qs 
rS1. 


Version June 23, 2004 


302 Statistics on Words with Applications to Biological Sequences 


6.6.3. Competing renewal counts 


Related results to the above for renewal counts are available. We consider non- 
overlapping occurrences in competition with each other. For example, in the 
sequence cgtatattaaaaatattaga, the set of words tat, tta and aa has renewal 
occurrences of tat at position 3 and 14, of tta at position 7, and of aa at 
positions 10 and 12. The occurrences of tat at position 5, of tta at position 
16, and of aa at positions 9 and 11 are not counted because they overlap with 
some already counted words. 
Let 


If(w") = I{a competing renewal of w" starts at position i in X1---Xy}, 


and let 


be the number of competing renewals of w” in the sequence X1X2--- Xn. 

For the mean p(w") = E(Rf(w")), some more notation is needed. For a 
matrix A denote its transposed matrix by A’, and, if A is a square matrix, 
Diag(A) represents the vector of the diagonal elements of A. Define the prob- 
abilities of ending a word for 1 < 7 < é, —1 as 


P,.(7) = P( collect final 7 letters of w"| start with correct ¢, — j initial 
letters of w") 
ae 
H(wryO—D) 


Then, in analogy to (6.2.2), the correlation polynomials are defined as 


Q,.r/(Z) =1+ oa 2PP,.(p). 


pEP(wr wr’) 


Define the q x q matrix 


and 


Moreover put K, = (w{)P,(¢- — 1) and define the vector 
Eig kG. 
Then the means p%(w"),7 = 1,...,q, are given by 


(uG_(w!), .--,wlw%))? = AK. 
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Gaussian approximation for the joint distribution of competing re- 
newal counts The main problem in the multivariate normal approximation 
is to specify the covariance structure. To state the result, quite a bit of notation 
is needed. Define 


K,(2) =z" P,(e,— 1) 
and the vector i 7 7 

K(z) = (Ki(z),...,Kq(z))”. 
Denote by 


Diag(K(z)) 


the q x q diagonal matrix with the components of K(z) as diagonal elements. 
Put 


K = K(1) 
Hg) = £A(2) 
H = H(1). 


Define the vector 
D= (,K),...,€,Kq), 


and the matrix 
Z= Ziy); 


where Z is defined in (6.5.2), and for a matrix A the matrix Ajy is the q x q 
matrix whose (r,7r’) entry is the element of A at the row corresponding to the 
last letter wp of the word w", and at the columns corresponding to the first 


y / . . . 
letter w{ of w” . Define the variance-covariance matrix 


1 
C= g(AK(AK —2HK —2AL)\? + (AK -2HK 2AL)(AK)") 
+Diag(AK)ZDiag(K)A? + ADiag(K)Z? Diag(AK) + Diag(AK). 
Now we have all the ingredients to state the normal approximation. 


THEOREM 6.6.6. Under Assumption (A1) we have 


ae =n) >, N(0,0). 


In the case of a single pattern, this theorem reduces to Theorem 6.5.1. 


siecle 
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Poisson approximation for the renewal count distribution For a Pois- 
son approximation, the problem can be reduced to declumped counts, as in 
the case of a single word. For a Poisson process approximation (and, follow- 
ing from that, a Poisson approximation for the counts), we want to assess 
P(IS(w") # Y;(w")). First consider P(IS(w") = 1,Y;(w") = 0). Note that, 
from (6.5.3), for i > &,, to have If(w") = 1, Y;(w”) = 0, there must be an oc- 
currence of w" at position 7, and this occurrence cannot be the start of a clump 
of w", so that there must be an overlapping occurrence of w” at some position 
j=i-é,+1,...,i-—1. Moreover, this occurrence cannot be a competing 
renewal, so there must be another word w” overlapping this occurrence. Hence 
we may bound 


P(If(w") = 1, ¥;(w") = 0) 


<P Ww) Seams Wee Mw" w"), 


pEP(w") r=l 


with M given in (6.6.2). For 7 < ¢, the above bound is still valid (the probability 
is even smaller since there is not always enough space for these clumps to occur). 
Secondly, consider P(I$(w") = 0,¥;(w") = 1). For IS(w") = 0,¥;(w") = 1 
to occur, there must be an occurrence of w” at position i, overlapped by an 
occurrence of a different word w” , so that we may bound 


P(IS(w") = 0, ¥4(w") = 1) < pw") D> p(w" )M(w"",w"). 


pEP(w") 
Hence 
q q 
PISAY) < SY (n- 6 +1)u(w") S> uw" )M(w" ,w") 
r=l1 Trail 
1 


Thus we obtain as a corollary of Theorem 6.6.2 
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COROLLARY 6.6.7. Under assumption (Al) and with the notation (6.6.2) and 
(6.6.5), we have 


dry (L(IS), L£(Z)) < (n _ Lain + 1) (6£ — 5) (SH p(w n) + ye Ty (w", w" ) 


l<r,r’<q 


qd 
+ lal’ So (6-8, we, )(n — & + 1)fi(w") 
r=] 
q qd / 


+ o(n= b+ Dalw") S> ww" )M(w"’, w") 


r=1 ‘gel 


1 
THe) 2 ae 


pEP(w") 


Note that the order of the approximation is the same as in Theorem 6.6.2; the 
additional error terms are comparable to T; and Tb, respectively. A Poisson 
approximation for the competing renewal counts follows immediately. 


Poisson approximation for competing renewal counts Alternatively to 
the above approach, a Poisson approximation similar to Theorem 6.5.2 for the 
number of competing renewals can be derived. Recall E(N‘°(w)) from (6.6.6), 
and [ from (6.5.5). 


THEOREM 6.6.8. We have that 


dry (eo: Ri,(w")), Po (>: B1R(u") 
< (1 = ae) D, + min fs aw} Dz + Ds, 


where 


D1 = (26-5) (Bos) 1 ry ae") —T(2lmin = 1) > n(w"), 


y= 


Dy = 2E(N°(w))p! (2+ 2° + p¥*~*) , 
Dy = (1-+min {1, CE(W"(w))#}) (seer (R§ (w")) — E(N(w ») | 


It is again interesting to consider the case that n — oo, for a sequence of 
words w'\") = (wh”,...,w%”) of maximal length 0“), where ¢(”) may grow with 
n. It is possible to show that under the conditions 
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() Titty 32, ERE (er) = A ce 


eer (n) 
(ii) limp—oo _. =0, 


the bound in Theorem 6.5.2 is of order O (-), so that S07_, R&(w™”) con- 


verges in distribution to a Poisson variable with mean A. 


6.7. Some applications to DNA sequences 


6.7.1. Detecting exceptional words in DNA sequences 


We call exceptional word a word w that appears in an observed sequence with a 
significantly high or low frequency. This significance is measured under a given 
probabilistic model by the p-value P(N(w) > Nobs(w)) using the distribution of 
the count N(w). Depending on the sequence length and on the expected count 
of the word it is often not realistic to use the exact distribution of the count 
since it is time consuming to calculate In this section, we will first give some ele- 
ments of comparison between the p-values obtained using the exact distribution 
(Section 6.4.1) and the ones obtained using the Gaussian approximation (Sec- 
tion 6.4.3) or the compound Poisson approximation (Section 6.4.5) or using the 
large deviation techniques (Section 6.4.6). For convenience, we will manipulate 
scores from R of the form ¢~!(p—value) rather than the p-values, where ¢ is the 
cumulative distribution function of the standard Gaussian distribution (probit 
normalization). Exceptionally frequent words would then have high positive 
scores whereas exceptionally rare words would have high negative scores. 


Quality of the approximate p-values For each word of length 3, 6 and 9 of 
the complete genome of the Lambda phage (¢ = 48 502), we can compare the ex- 
act scores under the Bernoulli model MO with the approximate ones using either 
the Gaussian approximation or the compound Poisson distribution (the parame- 
ters are assumed to be known). The results are presented on Figure 6.1 together 
with the approximate scores obtained with the large deviation approach: the 
x-axis of each plot is for the exact scores of 3-words (first row), 6-words (second 
row) and 9-words (last row). The y-axis is for the scores approximated with 
the Gaussian approximation (first column), the compound Poisson distribution 
(second column) and the large deviation approach (last column). Due to nu- 
merical errors the exact score of 5 words of length 3 have not been calculated 
successfully. We observe that the accuracy of the Gaussian approximation de- 
creases as the length of the words increases (rare words). The compound Poisson 
approximation is surprisingly satisfactory even for short (frequent) words. This 
agrees with the evolution of the total variation distance between the exact dis- 
tribution of the count and both approximate distributions; when the expected 
count of the word is close to 100 or greater then the accuracy of the Gaussian 
approximation is very good. The large deviation approach seems also to provide 
a good approximation for the exceptional words. However, it cannot manage 
with words having an estimated expected count too close to the observed one. 
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Figure 6.1. Normalized p-values of the counts of all the words of length 
3, 6 and 9 in the genome of the Lambda phage (¢ = 48 502). Comparison 
of the Gaussian, the compound Poisson and the large deviation approxi- 
mations (y-axis) with the exact scores (x-axis). 


The p-value is then set to 1/2 in this case and the flatness of the curves is an 
artifact. An important feature is that every method to calculate or to approxi- 
mate the p-values seems to classify the words in the same way; the score ranks 
are almost the same. Moreover, in this example, the three methods agree on 
the fact that there are no exceptionally rare words of length 9. 


Influence of the model Whatever the word count distribution used to cal- 
culate the normalized p-value, the choice of the model, in particular the order 
m of the Markov chain, is important to interpret the exceptionality of a given 
word. Using the model Mm means taking into account the 1- to (m+ 1)-letter 
word composition of the sequence. Therefore, the greater the order m of the 
model, the closer the random sequences will be to the observed sequence, and 
fewer unexpected words will be found. As an example, Figure 6.2 shows the 


Version June 23, 2004 


308 Statistics on Words with Applications to Biological Sequences 


discrepancy of the scores for the 8-letter words in the complete genome of E. 
coli (€ = 4638 858) under models MO to M6. For each model, the box contains 
half of the 65536 scores, the horizontal line is drawn through the box at the 
median of the data, the upper and lower ends of the box are at upper and lower 
quartiles (25% and 75%) and vertical lines go up and down from the box to the 
extremes of the data, except for the outliers, which are plotted by themselves. 
Here the outliers are the scores that are separated from the box by at least 3 
times the inter-quartile range (height of the box). In models M7 and higher, 
all the 8-letter words have a null score since their counts are included in these 
models: they are expected as they occur. M6 is then the maximal model for 
words of length 8. 


° 


}@ MO COOBOO 0. 


00 00 


° 


eo 


i 
T 


Figure 6.2. Boxplots of the 8-letter word scores in the complete genome 
of E. coli under models MO to M7. 


To analyze the frequency of a é-letter word, the maximal model is of order 
m = €— 2; in this model the exceptionality of a word of length @ cannot be 
explained by an unexpected sub-word, since all the sub-word frequencies are 
included into the maximal model. On the contrary, in small models such con- 
tamination by exceptional sub-words may occur. As an illustration let us con- 
sider the following example: Figure 6.3 compares the scores (using the Gaussian 
approximation) of all the 256 4-letter words in the complete Lamdba genome 
under the models M1 (a-axis) and M2 (y-axis). The most over-represented 4- 
word under M1 is ccgg, and it remains significantly over-represented under M2 
while taking into account the counts of ccg and cgg. However, many words 
lose their exceptionality when the order of the model increases. For example, 
gctg loses its exceptionality as soon as one takes into account the fact that 
ctg occurs 1169 times and is thus a significantly frequent 3-letter word (see 
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Figure 6.3. Scores of the 4-letter words in the Lambda genome under 
M1 (a-axis) and M2 (y-axis). 


Table 6.2). The model M2 says that the 406 occurrences of gctg are expected 
according to the 3-letter word composition of the sequence: gctg is expected 394 
times under M2 (see Table 6.1). Its exceptionality under M1 (expected only 255 
times) is an artifact due to the important over-representation of its sub-word 
ctg. The number of times that we see gctg is not surprising given the number 
of occurrences of ctg. This is what we call a contamination. Another such 
example is ctag: it is exceptionally rare under M1 but not under M2. On the 
other hand, some exceptionality may be hidden in small models and be revealed 
in higher models, leading to very interesting interpretations. As an example, 
ccat is not exceptional under M1 and becomes one of the most over-represented 
word under M2. If we look at its two sub-words of length 3, cca and cat are 
slightly under-represented (see Table 6.2). Given their low frequency, ccat is 
expected only 191 times under M2, which is is significantly less than the 218 
observed occurrences. So, cca and cat are slightly avoided in the sequence but 
they are preferentially overlapping in the sequence. This is more pronounced 
for tagt which is composed of the most avoided 3-word tag and is declared 
under-represented under M1 (contamination in fact), but it seems that there is 
an important constraint for these occurrences of tag to be followed by a t. 
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Model M1 Model M2 
w | N(w) NM, (w) oi(w) score rank No (w) o2(w) score rank 
ctag 14 101.8 9.5 -9.21 2 28.7 4.7 -3.10 27 
tagt 71 104.0 9.6 -3.42 57 47.3 5.8 4.07 246 


ccat 218 191.1 12.6 2.12 180 168.6 10.0 4.94 253 
gctg 406 255.2 14.3 10.52 254 394.6 E19 0.96 170 
ccgg 328 169.7 12.0 13.16 256 273.5 11.6 4.68 252 


Table 6.1. Statistics of some 4-letter words in the Lambda genome under 
the models M1 and M2. The rank of the scores are obtained while sorting 
the 256 scores by increasing order. 


w | N(w) Mi (w) oi(w) score rank 


tag| 217. 4812 176 -15.04 I 
cat | 803 869.4 216 -3.07 18 
cea} 675 7065 19.9 -158 25 
agt | 595 590.2 191 0.25 34 
ect | 856 8066 20.7 239 46 
ceg| 963 7721 210 9.10 60 
ceg| 884 6843 19.7 1015 61 
ctg| 1169 8024 208 17.63 63 


Table 6.2. Statistics of some 3-letter words in the Lambda genome under 
model M1. The rank of the scores are obtained while sorting the 64 scores 
by increasing order. 


Utility of models Mm_3 Coding DNA sequences are composed of successive 
trinucleotides called codons. Each base in the sequence is associated to a phase 
k in {1,2,3} depending on its position in the associated codon. In the general 
model Mm_3, the transition probabilities of a letter depend on its phase and 
word occurrences can be analyzed separately for each phase or for all phases 
together (see p. 277); note that N(w) = >°, N(w,k). Recall that the phase of 
an occurrence is by convention in this chapter the phase of its last letter. It is 
well-known to biologists that there exists a bias in the codon usage: codons that 
code for the same amino acid are not used uniformly. The following analysis 
illustrates the importance of taking the 3-letter word composition on each phase 
into account, in particular the codon composition (3-words on phase 3). Let us 
consider 36 genes of E. coli (¢ = 44856) and analyze the trinucleotide frequency. 
Figure 6.4 shows that the majority of the trinucleotides have the same behavior 
under M1 or M1.3; however, some trinucleotides are less exceptional when one 
takes the phase into account. If we now calculate the scores of the trinucleotides 
on phase 1, on phase 2 and on phase 3 under M1_3, we see that the main 
exceptional trinucleotides are the ones on phase 3: the codons. Figure 6.5 
presents the discrepancy of theses scores: codons are much more exceptional 
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than the trinucleotides on phase 1 and 2. 


“j2-9 6 -3 0 3 6 9 12 15 18 21 24 


Figure 6.4. Scores of the 3-letter words in 36 genes of FE. coli under the 
models M1 (a-axis) and M1_3 (y-axis). 


phase 1 phase 2 phase 3 all 


Figure 6.5. Boxplots of the scores of the 3-letter words for each phase 
and for all phases in 36 genes of E. coli under the model M1_3. 


Figure 6.6 compares the scores of the 4-words on phase 1 under M1.3 (the 
codon composition is not taken into account) and M2_3 (the codon composition 
is taken into account). Note that a 4-word on phase 1 starts with a codon. The 
3 most over-represented codons are ctg, cag and tgg. This over-representation 
is responsible of the exceptionality of ctgg, tggt, tggc and cagc. The over- 
representation of cagg seems to be a strong constraint since it is still exceptional 
given the high frequency of cag. When analyzing coding sequences, to be sure 
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to find exceptional words that are not contaminated by the codon usage, the 
minimal model to use is the model M2_3. 


+ + + + + + + + + + + + + 
-10-8 -6 4-20 2 4 6 8 10 12 14 16 18 


Figure 6.6. Scores of the 4-letter words on phase 1 in 36 genes of E. coli 
under the models M1_3 (a-axis) and M2_3 (y-axis). 


6.7.2. Sequencing by hybridization 


As a slightly more involved example of how statistics and probability on words 
are applied in DNA sequence analysis, we describe a problem related to sequenc- 
ing by hybridization. Sequencing by hybridization is an approach to determine 
a DNA sequence from the unordered list of all ¢tuples contained in this se- 
quence; typical numbers for @ are £ = 8,10,12. It is based on the fact that DNA 
nucleotides bind or hybridize with each other: a and t hybridize, and c and 
g hybridize. DNA strands have a polarity (5’, 3’), and hybridizing sequences 
must be of opposite polarity. To avoid introducing notation to show polarity, 
we present complementary strands written in reverse direction. For example, 
the sequence tgtgtgagtg hybridizes with acacactcac. In a sequencing chip, all 
4° possible oligonucleotides (“probes”) of length @ are attached to the surface of 
a substrate, each fragment at a distinct location. 

To use an SBH chip, the single-stranded target DNA is amplified, labeled 
by a fluorescent, and exposed to the sequencing chip. The probes on the chip 
will hybridize to a copy of the single-stranded target DNA if the substring 
complementary to the probe exists in the target. These probes are then detected 
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with a spectroscopic detector. For example, if = 4, the sequence tgtgtgagtg 
will hybridize to the probes acac, actc, caca, cact, ctca and tcac. 

As chips can be washed and used again, and due to automatization, this 
method is not only fast but also inexpensive. There are still technical difficulties 
in producing an error-free chip; moreover the SBH image may be difficult to read. 
We remark that the microarray industry grew out of attempts to make SBH 
technology practical. However, even if these sources of errors are eliminated, 
a major drawback of the SBH procedure is that more than one sequence may 
produce the same SBH data. For example, if @ = 4, the sequence acactcacac 
will hybridize to the same probes as the sequence acacactcac. 

To control this error resulting from non-unique recoverability, we are inter- 
ested in an estimate for the probability that a sequence is uniquely recoverable. 
This probability will depend on the probe length @, on the length n of the tar- 
get sequence, and on the frequencies of the different nucleotides, a, c, g and t, 
in the sequence. Furthermore we need to bound the error made in estimating 
the probability of unique recoverability in order to make assertions about the 
reliability of the chip. 

As a simplification, we assume that we not only know the set of all ¢-tuples 
in the sequence but also their multiplicity (but not the order in which they 
occur). This multiset is called the (-spectrum of the sequence. In the sequel, 
unique recoverability is understood to mean unique recoverability of a sequence 
from its (spectrum. 

Unique recoverability from the ¢-spectrum can be characterized using the de 
Bruijn graph whose vertices are the (¢—1)-tuples in the sequence. Two vertices 
v and w are joined by a directed edge from v to w if the é-spectrum contains 
an ¢-tuple for which the first (¢ — 1) nucleotides coincide with v and the last 
(€ — 1) nucleotides coincide with w. A sequence is uniquely recoverable from 
its spectrum if and only if there is a unique (Eulerian) path connecting all 
the vertices. It was shown that there are exactly three structures that prevent 
unique recoverability: 

1. Rotation. The sequence starts and ends with the same (¢— 1)-tuple. In 
this case, the de Bruijn graph is a cycle, and any vertex could be chosen as the 
starting point. 

2. Transposition with a three-way repeat. If an (¢—1)-tuple occurs three 
times in the sequence, then the de Bruijn graph has two loops at this vertex, 
and the order in which these loops are passed is not fixed. 

3. Transposition with two interleaved pairs of repeats. There are two 
“interleaved” pairs of (¢ — 1)-tuple repeats, ie. in the de Bruijn graph there 
are two vertices x and y connected by a path of the form...c...y...@...y..., 
where we described a path connecting all the vertices by listing the vertices in 
the order they are used in the path. This implies that there are two ways of 
going from « to y in the graph. 


EXAMPLE 6.7.1. The sequence acacactcac possesses as 4-spectrum the mul- 


tiset {acac, acac, caca, cact, actc, ctca, tcac}. The competing sequence 
acactcacac has the same 4-spectrum. The de Bruijn graph for the sequence 
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acacactcac has as vertices aca, cac, act, ctc and tca. There are two directed 
edges from aca to cac, and one directed edge each from cac to aca, from cac 
to act, from act to ctc, from ctc to tca, and from tca to cac. The com- 
peting sequence acactcacac has the same de Bruijn graph. For the sequence 
acacactcac, a path connecting all vertices is 


aca, cac, aca, (cac, act, ctc, tca), cac. 
The alternate path 
aca, (cac, act, ctc, tca), cac, aca, cac 


also connecting all the vertices, corresponds to the sequence, acactcacac, with 
the same 4-spectrum. 


Thus unique recoverability can be described in terms of possibly overlapping 
repeats of (€— 1)-tuples within a single sequence. We use the model MO. For a 
sequence to be uniquely recoverable, the event of an (¢— 1)-tuple repeat should 
be rare. This implies that we consider the occurrence of (¢— 1)-tuples under a 
Poisson regime. (Note that we are interested in the configuration in which the 
repeats occur; hence we need a Poisson process approximation for the process 
of repeats rather than a Poisson approximation for the number of repeats.) If 
repeats are rare, then three-way repeats are negligible, and so is the probability 
that a sequence starts and ends with the same (¢— 1)-tuple. After bounding 
these probabilities, we thus restrict our attention to interleaved pairs of repeats. 
Under the Poisson regime, if there are k pairs of repeats, then the occurrences of 
these repeats are discrete uniform. Additional randomization makes the position 
of the repeats continuously uniform, so that all orderings of these pairs will be 
approximately equally likely. This allows the application of a combinatorial 
argument using Catalan numbers to obtain that the number of interleaved pairs 
of repeats, if k repeats are present, is approximately 2*/(k + 1)!. If A is the 
expected number of repeats of ¢-tuples in a single sequence, we hence get, for 
the probability Pp that X1X2...X,y is uniquely recoverable from its ¢-spectrum, 


2 (2))* 
Pore a s —_——_—_—_.. 
| ! 
= k\(k +1)! 


The Chen-Stein method for Poisson approximation provides explicit bounds for 
the error terms in this approximation, as follows. 

In the sequence X,...X,, of independent identically distributed letters, let 
P= Vaca u?(a) be the probability that two random letters match. We write 
t for —1, as we are interested in (¢ — 1)-repeats. Again we have to declump: 
Define Y;,; = 0 for all i, and 


Y.- W{X1---X_ = Xj41-+- Xj+e} ifi=0 
“sd (1 = TX; = Xj TL Xi sim Xi+t = Xj41 oe -X 544} otherwise. 


Thus Y;,; = 1 if and only if there is a leftmost repeat starting after i and 7. 
Put J = {(i,7), 1 <t,7 <n—€+4+1}. A careful analysis yields that the process 
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Y = (Ya)aer is sufficient to decide whether a sequence is uniquely recoverable 
from its ¢-spectrum (although Y contains strictly less information than the 
process of indicators of occurrences). 

For a Poisson process approximation, we first identify the expected number 
A of leftmost repeats. If a = (i,j) does not have self-overlap, that is, if 7 —i >t, 
then 


; Poe 
_ fp if7=0 
E(Ya) = { (1 — p)p’ otherwise. 


Hence the expected number A* of repeats without self-overlap is 


A= (" 9 *) (1 — p)p* + (n — 2t)p*. 
If a does have self-overlap, then, in order to have a leftmost repeat at a, for 
indices in the overlapping set, two matches are required, and for indices in the 
non-overlapping set, one match is required. Let d = j — i; then E(Y,,) depends 
on the decomposition of ¢ + d into a quotient q of d and a remainder r (such 
that t+d=qd+r): if pg is the probability that q random letters match, then 


Tr d—-r Bi Ae 
E(Y,) = ame i 


r,jd—r 


(Pq — Pg+1) pq | otherwise. 


If A* is bounded away from 0 and infinity, which corresponds to having t = 
210g; /,(n) +c for some constant c, then it can be seen that 


n2 


ej ee _ t 
Aw (1 — pp’. 


Under the regime that is bounded away from 0 and infinity, here is a 
general result. Let max = max, p(a) be the probability of the most likely 
letter. 


THEOREM 6.7.2. Let Z = (Za)aer be a process with independent Poisson 
distributed coordinates Y,, with E(Z.) = E(Yq),a € I. Then 


drv(Y, Z) < b(n, t), 
where the error term b(n,t) is such that 
Qt: F ; 
Noni x { 16. ;, in the uniform case 


to: : : 
Nbmax in the nonuniform case. 


6.8. Some probabilistic and statistical tools 


6.8.1. Stein’s method for normal approximation 


Stein’s method for the normal approximation makes it possible to obtain mul- 
tivariate normal approximations with a bound on the error in the distance of 
suprema over convex sets, as follows. 
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Let H denote the class of convex sets in R?. Let Y;,j =1,...,n be random 
vectors taking values in R?, and let W = ae Y; be the vector of sums. 


Assume there is a constant B such that |Y,| := oan Yul & 8. Let Zw 
N(0,Ia) have the d—dimensional standard multivariate normal distribution. 


THEOREM 6.8.1. Let S; and N; be subsets of {1,...,n}, such that i € S; C 
N;,i=1,...,n. Assume that there exist constants D, < D2 such that 


max{Card(S;),i=1,...,n} < Dy 


and 
max{Card(N;),7=1,...,n} < Do. 


Then, for d = 1 there exists a universal constant c such that 


sup |P(W < x) —P(Z<2)| 
zeER 


< c{2D.B + n(2 + (E(W2))D,D2B? + x1 + x2 + x3}- 
For d > 1 there exists a constant c depending only on the dimension d such that 


sup [P(W € A) — P(Z € A)| < c{2VdD2B + 2VdnD,D>B*(|log B| + logn) 
AEH 


+x1 + (| log B| + logn)(x2 + x3)}, 


where 
xX1= SCE E(Y,| S- Yx) 
j=l k€S; 
x2 = ELEY) YW") -E¥,(S> YW)" DO YD) 
j=1 keS; keS; lEN; 
xs = |I— DU B(Y)(D/ Ye)")). 
j=l keS; 


Note that there are no explicit assumptions on the mean vector and the 
variance-covariance matrix; however, for a good approximation it would be de- 
sirable to have the mean vector close to zero, and the variance-covariance matrix 
close to the identity. 


6.8.2. The Chen-Stein method for Poisson approximation 


The Chen-Stein method is a powerful tool for deriving Poisson approximations 
and compound Poisson approximations in terms of bounds on the total variation 
distance. For any two random processes Y and Z with values in the same space 
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FE, the total variation distance between their probability distributions is defined 
by 


drv (L(Y), £(Z)) = sup |\P(¥ e B)- PiZe B)| 
BCE,measurable 
= sup |E(h(Y)) — E(A(Z))|- 


h:E-([0,1], measurable 


The following general bound on the distance to a Poisson distribution is avail- 
able. 


THEOREM 6.8.2. Let I be an index set. For each a € I, let Y,, be a Bernoulli 
random variable with pa = P(Ya = 1) > 0. Suppose that, for each a € I, we 
have chosen By, C I with a € By. Let Z., a € I, be independent Poisson 
variables with mean py. The total variation distance between the dependent 
Bernoulli process Y = (Yq,a@ € I) and the Poisson process Z = (Zq,a € I) 
satisfies 

dry (L(Y), £(Z)) < b1 + be + bs, 


where 


b=) >> paps (6.8.1) 


aél BEBa 

b=S> So B(Y¥s) (6.8.2) 
a€l BEBa,BAa 

bs =) BIE {Ya — palo(¥s, 2 ¢ Ba)}I- (6.8.3) 
ael 


Moreover, if W = oy ae7 Yo and X= Vo ver Pa < 00, then 


drv(L(W), Po(A)) < — (b1 + b2) + min (1 \=) b3. 


Note that bs = 0 if Yq is independent of o(Yg,3 ¢ Ba}. We think of By as 
a neighborhood of strong dependence of Yq. 

One consequence of this theorem is that for any indicator of an event, i.e. 
for any measurable functional h from EF to [0,1], there is an error bound of the 
form |E(h(Y)) — E(h(Z))| < dev (L(Y), £(Z)). Thus, if T(Y) is a test statistic 
then, for allt € R, 


IP (TY) 2 t) -— P(T(Z) 2 t)| <b + bo + 3, 


which can be used to construct confidence intervals and to find p-values for tests 
based on this statistic. 

Note that this method can also be used to prove compound Poisson ap- 
proximations. For multivariate compound Poisson approximations it is very 
convenient. For univariate compound Poisson approximations, better bounds 
are at hand, as will be illustrated in the next subsection. 
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6.8.3. Stein’s method for direct compound Poisson approximation 


A drawback of the point process approach to compound Poisson approximation 
is that the bounds are not very accurate. Instead it is possible to set up a 
related method for obtaining a compound Poisson approximation directly, in 
the univariate case. 

Denote by CP(A,v) the compound Poisson distribution with parameters 
A and v, that is, the distribution of the random variable )>,., kM, where 
(Mji,)n>1 are independent, and My ~ Po(Avz), k = 1,2,.... ~ 

The particular case where \ = n@(1 — p) and 4%, = p*—1(1 — p) for some 
@>0O0and0<p<1, is called the Polya-Aeppli distribution. 

Again, let J be an index set, and let 


W= Sova, 
ael 


where (Va)aer are nonnegative integer-valued random variables. Similarly to 
the Poisson case, for each a € I decompose the index set into disjoint sets as 


T=auSyUW,UU4. 


Here, Sy would correspond to the set of indices with strong influence on a, Wa 
would correspond to the set of indices with weak influence on a, and U,, collects 
the remaining indices. Put 


Ba= >, Va 


BESa 
Wa SV 

BEWa 
Ug Ves 

BUa 


Then, fora € J, 
W=V.+Sat+ Wat Ua. 


Define the canonical parameters (A, v) of the corresponding compound Poisson 
distribution by 


1 
AVE = k D4 E{VallVa + Sq = k)}, Jk 21 


A= So kp. (6.8.4) 


k>1 


Put 
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and 
M1 kb = E(V,) 


m, =E(W) =) > mie. 
ael 


Similarly to the Poisson case, we shall need the quantities 


= (a) P(Va = j, Sa = k|We) 
i= Doma D aw | pea eee 
acl j2ik>1 


62 = 25) E {Vadry(L(Wa|Va; Sa); £(Wa)} 
ael 


53 = S > {E(VaUa) + E(Va)E(Va + Sa +Ua)}- 
acl 


Then, roughly, 63 corresponds to b; + bz in the Poisson case, whereas 6, and 62 
correspond to b3 in the Poisson case. 
The following result can be shown to hold. 


THEOREM 6.8.3. There exist constants Hp = Ho(A,v), Hi = Hi(A,v), inde- 
pendent of W, such that, with (A, v) given in (6.8.4), 


drv(L(W), CP(A, v)) < Ho min(d1, 62) + Aids, 
and 
Ho, Hy < min(1, (Am)71)e. 
If in addition 
kup > (K+ 1)Ypgi, k>1, (6.8.5) 


then, with y = \(1 — 2v2), 


resins (0-5) 
Hy, < min {1 7 (+ +1og*(2)) b. 


An important special case is the declumped situation, that is, W can be 


written as 
W= S07 Flaw, 


a€Ilk>1 


where 


Tux = I(a is the index of the representative of a clump of size k). 
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For a € I,k €N, let B(a,k) C I x N contain {a} x N; this set can be viewed 
intuitively as the neighborhood of strong dependence of (a, k). 
The canonical parameters are now 


A= 2 pa E(Lox) 
a€lk>1 
Ve =~" S$ E(Han): (6.8.6) 


ael 


For example, if Ig, = Yi, ks then W = N(w), and the canonical parameters 
are (n—£+1)fix,k > 1, and \ = E(N(w)), so that the approximating distri- 
bution is as before, £(}°,5, kZ,) with Z;,’s independent Poisson variables with 
parameters (n— + 1)fiz. Thus it is the same distribution as in Corollary 6.4.8. 

Similarly as in the Poisson case, we shall need the quantities 


j= > S2 KkE(Lox)E(L x’) 


(a,k)EIXN (B,k')EB(a,k) 


a= by KkKE(La Un) 
(a,k) EI XN (B,k')EB(a,k)\{(a,k)} 


B= > kEE{Tox — E(Lox)|o(Lsn, (8k) ¢ Bla, kD}. 
(a,k)EIXN 


The following result holds. 


THEOREM 6.8.4. With the parameters as in (6.8.6), we have that 


dry(L(W), CP(A, v)) < b3Ho + (0; + 63) Hh. 


6.8.4. Moment-generating function 


Here is a short outline of moment-generating functions. The moment-generating 
function M of a random variable X is defined as 


®x(t) = E(e’*). 


So, if X has a discrete distribution p, we have that 
®x(t) = > e'* (a). 


If the moment-generating function exists for all ¢ in an open interval containing 
zero, it uniquely determines the probability distribution. 

In particular, under regularity conditions the moments of a random vari- 
able can be obtained via the moment-generating function using differentiation. 
Namely, if ® x(t) is finite, we have 
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Thus 
%', (0) = E(X) 


if both sides of the equation exist. Similarly, differentiating r times we obtain 
o\(0) = E(x”). 


A special case is when the moment-generating function ®x(t) is rational, 
that is, when ®x(t) can be written as 


Potpit+...+prt" d 
@x(t) = 2 d)t*, 
) got qt+...+ gst® dS) 
for some r,s and coefficients p,,...,Pr,91,---;¢s- By normalization we may 


assume gg = 1. Then 
potpitt...t+ prt” = >° f(dt*(1+ qtt+...+qet*). 
d 
Identification of the coefficients of t’ on both sides yields 
p= - f(d)qi-a fori<r 
d=0 


0= be f(d)qi-a for i > r. 
d=0 


This gives a recurrence formula for the coefficients f(d); we have 


F(0) = po 
min(d,s) 

f@D=pa—- D> fd-da, @>1 
w=1 


where pg = 0 ford>r. 


6.8.5. The 6-method 


In general, the d-method, or propagation of error, is a linear approximation 
(Taylor expansion) of a nonlinear function of random variables. Here we are 
particularly interested in the validity of a normal approximation for functions 
of random vectors. 


THEOREM 6.8.5. Let X,, = (Xni, Xn2,---,Xnk) be a sequence of random vec- 
tors satisfying 


bn(Xn — 2) —> N(0, 5) 


Version June 23, 2004 


322 Statistics on Words with Applications to Biological Sequences 


with b, — oo. The vector valued function g(x) = (gi(Z),---,ge(x)) has real 
valued g;(x) with non-zero differential 


0g: = Ogi Ogi 
Oga - Obs, * OG , 


Define D = (di,;) where dig = f(y). Then 


bn(g(Xn) — glu) + N(0, DED"). 


6.8.6. A large deviation principle 


Assume X ,---Xy,, is an irreducible Markov chain on a finite alphabet A with 
transition probabilities (a,b), a,b € A. Large deviations from the mean can 
be described as follows. 


THEOREM 6.8.6 (Miller). Let f be a function mapping A into R. Then, 
n't S*"_, f(Xi) obeys a large deviation principle with rate function I defined 
below: for every closed subset F C R and every open subset O C R, 


1 1 n 
lim sup — log P (2 So f(Xi)e€ r) < — inf I(z), 


n—+0o reF 


i= 


1 
li eed P 1 S* 40x,) €0 > — inf I(x) 

= = ‘ > — inf I(x). 
ea a 
The rate function I is positive, convex, uniquely equal to zero at x = E(f(X1)) 
and given by 

I(x) = sup(6a — log X(0)) 
0 


where (9) is the largest eigenvalue of the matrix (e°! x(a, b)), eo 


6.8.7. A CLT for martingales 


For martingales, the following central limit theorem is available. 


ms n be a triangular array of d-dimensional ran- 
dom vectors such that E||€n,i||3 <0o, and V be a positive d x d matrix. Put 
Fri = O(fn13--+;€n,i)3 E(En | Fn i—1) denotes the conditional expectation vec- 
tor of &),; and Cov(&n.i | Fni-1) denotes the conditional covariance matrix of 
én i- If asin — oo 


(i) S E(En a | Fni—1) > 0, 


i=l 


(ii) S~ Cov(n,i | Fni-1) > V; 
4=1 
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(iii) Ve > 0, S* (inal > € | Fni—1) > 0, 
i=1 


then ) 5, 69,4 2,N(0,V). 


Notes 


The material in this chapter can be viewed as an updated version of Reinert 
et al. (2000). Recent progress on exact distributional results, as well as on 
compound Poisson approximation and on multivariate normal approximation, 
is included. 

This chapter does not deal with the algorithmic issues; an excellent starting 
point would be Waterman (1995) or Gusfield (1997). For a particular example 
see also Apostolico et al. (1998), and for a recent exposition see Lonardi (2001). 


Number of clumps. Equations (6.2.6) and (6.2.9) that characterize the occur- 
rence of a clump, or a k-clump, of the word w at a given position with respect 
to the periods of w are due to Schbath (1995a). 


Word locations. The recursive formula for the exact distribution of the distance 
D between two word occurrences (Theorem 6.3.1) is from Robin and Daudin 
(1999). It was first proposed for independent and uniformly distributed letters 
by Blom and Thorburn (1982). The moment-generating function of the distance 
D, expressed as a rational function and given in Theorem 6.3.2, also comes from 
Robin and Daudin (1999). Recently, Stefanov (2003) obtained another expres- 
sion for the generating function that avoids the calculation of the “infinite” sum 
of the II"’s. 

Similar results are derived in Robin and Daudin (2001) and Stefanov (2003) 
for the probability distribution of the distance between any word in a given 
set. They are not presented in Section 6.6 but are useful for instance for the 
purpose of calculating the occurrence probability of a structured motif (Robin 
et al. (2002)). These motifs are of particular interest since they are candidate 
promotors for transcription. Indeed, a structured motif is of the form v(d; : 
dz)w, denoting a word v separated from a word w by a distance between d; and 
dg; where v and w can be approximate patterns. Efficient algorithms exist to 
find such structured motifs (Marsan and Sagot (2000a)). 

A related problem concerns the position T; of the first occurrence of a word; 
it is treated in Rudander (1996) and more recently in Stefanov (2003). The 
moment generating function of T; given on page 271 is taken from Robin and 
Daudin (1999). 

The Poisson approximation for the statistical distribution of the k-smallest 
r-scan presented on page 268 is due to Dembo and Karlin (1992). This approx- 
imation is very useful for the comparison between the expected distribution of 
the r-scans and the one observed in the biological sequence. It has been first 
applied in Karlin and Macken (1991) to the EF. coli genome by approximating 
the r-scan distribution given in Section 6.3.1 by a sum of r — 1 independent 
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exponential random variables. Robin and Daudin (2001) refined this approxi- 
mation using the exact distribution of the r-scans. Related work is presented 
by Robin (2002) but using a compound Poisson model for the word occurrences 
rather than a Markov model for the sequence of letters. This new approach 
has the advantage of taking the eventual unexpected frequency of the word of 
interest into account when analyzing its location along a sequence. See Glaz 
et al. (2001) for more material and applications of scan statistics. 


Word count distribution. The method of obtaining the exact distribution of the 
word count presented here generalizes the result that Gentleman and Mullin 
(1989) obtained for the case that the sequence is composed of i.i.d. letters, where 
each letter occurs with equal probability. In this case, Gentleman (1994) also 
gives an algorithm for calculating the word frequency distribution. Moreover, in 
the Markov case the exact distribution of the count can also be obtained by other 
techniques: Kleffe and Langbecker (1990) as well as Nicodéme et al. (2002) used 
an automaton built on the pattern structure matrix, whereas Régnier (2000) 
used a language decomposition approach to obtain the generating function of 
the count (see Chapter 7). 

The variance (6.4.1) of the count N(w) is inspired by Kleffe and Borodovsky 
(1992). 


Gaussian approximation. The asymptotic normality of the difference between 
the word count and its estimator was first proposed by Lundstrom (1990) using 
the 6-method. For an exposition, see Waterman (1995). The two alternative 
approaches presented in this chapter, the martingale and the conditional ones, 
have the advantage to provide explicit formulas for the asymptotic variance. 
They are both due to Prum et al. (1995) for the first order Markov chain model, 
and to Schbath (1995b) for higher order models and phased models. The con- 
ditional expectation of the count is originally due to Cowan (1991). 

The bound Theorem 6.4.4 on the distance to the normal distribution was 
obtained by Huang (2002). This paper, and references therein, discusses also 
the constant c which appears in the bound. The result in the independent case 
was first presented in Reinert et al. (2000). 


Poisson and compound Poisson approximations. When the sequence letters 
are independent, Poisson and compound Poisson approximations for N(w) have 
been widely studied in the literature (Chryssaphinou and Papastavridis (1988a), 
Chryssaphinou and Papastavridis (1988b), Arratia et al. (1990), Godbole (1991), 
Hirano and Aki (1993), Godbole and Schaffner (1993), Fu (1993)). Markovian 
models under different conditions have then been considered (Rajarshi (1974), 
Godbole (1991), Godbole and Schaffner (1993), Hirano and Aki (1993), Geske 
et al. (1995), Schbath (1995a), Erhardsson (1997)), but few works concern gen- 
eral periodic words and provide explicit parameters of the limiting distribution. 
Our two basic references in this chapter are Arratia et al. (1990) and Schbath 
(1995a). 


Version June 23, 2004 


Notes 325 


For the compound Poisson and Poisson approximation error term due to 
the estimation of the transition probabilities, refer to Schbath (1995b). Reinert 
and Schbath (1998) showed that the end effects due to the finite sequence are 
negligible for the count (Equation (6.4.11)) and the count of clumps. 

The special case of runs of 1 in a random sequence of letters in the bi- 
nary alphabet {0,1} is extensively studied: Erdés and Rényi (1970) gave the 
asymptotic behavior of the longest run in a sequence of Bernoulli trials, and 
of the length of the longest segment that contains a proportion of 1 greater 
than a predescribed level a. Their result was refined by Guibas and Odlyzko 
(1980), Deheuvels et al. (1986), and Gordon et al. (1986). The compound Pois- 
son approximation for counts of runs in the case where the sequence letters are 
independent was considered by Eichelsbacher and Roos (1999), also employing 
the Chen-Stein method using results by Barbour and Utev (1998) (the limiting 
distribution is the same as the one given in (6.4.15), reduced to this special case). 
Barbour and Xia (1999) obtained a more accurate limiting approximation for 
the case of runs of length 2; this approximation is based on a perturbation of a 
Poisson distribution. 


Direct compound Poisson approximation. Theorem 6.4.9, which presents a direct 
compound Poisson approximation of the count, is due to Barbour et al. (2001). 
They give a more general form of the result, and also a bound for the Kolmogorov 
distance. Using the approach by Erhardsson (1999), they also derive a slightly 
less explicit but asymptotically better bound in terms of stopping times for a 
Markov chain. 

Indeed, in Erhardsson (1997), Erhardsson (1999) and Erhardsson (2000), a 

different approach based on the direct compound Poisson approximation The- 
orem 6.8.3 is developed. The idea is to express counts of events as numbers of 
visits of a certain Markov chain to a rare set, and to use regeneration cycles for 
suitable couplings. It results in bounds that are formulated in terms of stop- 
ping times of Markov chains. Results of this type are less explicit, but they 
have asymptotic order O(n~') under the typical regime nu(w) = O(1), see also 
Barbour et al. (2001) and Gusto (2000), whereas the bounds in Theorem 6.4.9 
and in Corollary 6.4.8 (which is from Schbath (1995a)) are of order O(n log n) 
under the same regime. 
Numerical experiments in Barbour et al. (2001) display that the bound in The- 
orem 6.4.9 and the bound from the Erhardsson (1997)-approach perform better 
than the bound in Corollary 6.4.8 for the word acgacg in the bacteriophage 
Lambda (n = 48,502) under three different transition matrices. In contrast, 
Gusto (2000) compared the result from Erhardsson (1999) to the one in Schbath 
(1995a) and did not observe any marked improvement for all words of length 8 
in the bacteriophage Lambda. This may illustrate that, whereas the compound 
Poisson approximation via a Poisson process approximation works well in the 
case of rare words, it does not yield the best bounds in the case of not so rare 
words. 


Approximation using a large deviation principle. Section 6.4.6 is inspired by 
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Schbath (1995b). Nuel (2001) obtained a better approximation using a large 
deviation principle for the empirical distribution of the ¢-letter words. This 
empirical distribution is defined as the random measure Ly,” on A® such that, 
for w € Af, 


i n—l+1 
Deel ras 22. 
4=1 


so that Ly ~(w) = N(w). However, the definition of the large deviation rate 
funtion and its mathematical treatment are more involved than the one given 
in Section 6.4.6. 


Renewal count distribution. For a classical introduction to renewals, see Chap- 
ter 13 in Feller (1968). Exact results for the distribution of R, can be found in 
Régnier (2000). When the letters X),...,X, are independent and identically 
distributed, the asymptotic distribution of the renewal count was studied by 
Breen et al. (1985) and Tanushev and Arratia (1997). The Central Limit The- 
orem 6.5.1 in the Markovian case is due to Tanushev (1996). He also proved 
a multivariate approximation. The theorem is much easier to prove in the 
iid. case, see Waterman (1995). The main technique being generating func- 
tions, no bound on the rate of convergence is obtained. 

The Poisson approximation for renewals based on the Poisson approximation 
for the number of clumps is the idea behind the proof of Geske et al. (1995), 
although Geske et al. (1995) prove the result only for words having at most 
one principal period. Related results have been obtained by Chryssaphinou and 
Papastavridis (1988b). Theorem 6.5.2 is due to Chryssaphinou et al. (2001); 
they also derive the stated conditions under which convergence to a Poisson 
distribution holds. 


Occurrences of multiple patterns. The multivariate generating function of the 
counts of multiple words can be found in Régnier (2000) and can be derived 
from Robin and Daudin (2001). The methods used are extensions of the ones 
presented in Subsection 6.4.1. 

The covariance was also calculated in Lundstrom (1990), in a different form. 
Theorem 6.6.1 is proven in Huang (2002); there it is also shown that L,, is 
invertible as well as a discussion of the constant c; see also references therein. 
As in Rinott and Rotar (1996), Huang (2002) considers more general classes of 
test functions as well, but not as general as to cover total variation. 

The Poisson and compound Poisson approximations for the joint distribution 
of declumped counts and multiple word counts presented here are due to Reinert 
and Schbath (1998). Recently, Chen and Xia (pear) obtained a much improved 
bound for the independent model, in the Wasserstein metric (which is weaker 
than the total variation metric), for the Poisson approximation of counts of 
palindromes, assuming the four-letter alphabet A = {a,c,g,t} and that pa = 
Pt, Pc = Pg- Formula (6.6.6) is due to Chryssaphinou et al. (2001). 
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Tanushev (1996) studied non-overlapping occurrences in competitions with 
each other, including the derivation of the mean for the number of competing re- 
newal counts, and, most notably, the normal approximation Theorem 6.6.6. The 
mean of the total number of competing renewals, )>7_, R&(w”), has recently 
been presented in a slightly simpler form by Chryssaphinou et al. (2001). Also 
the alternative approach for a Poisson approximation for competing renewal 
counts is given in Chryssaphinou et al. (2001). 


Some applications to DNA sequences. The quality of the approximate p-values 
was extensively studied in Robin and Schbath (2001); their results here are com- 
bined with the approximate scores obtained with the large deviation approach 
of Nuel (2001). 

The details on the treatment of sequencing by hybridization as presented here 
are given in Arratia et al. (1996). The characterization of unique recoverability 
from the ¢spectrum is due to Pevzner (1989); Ukkonen (1992) conjectured 
and Pevzner (1995) proved that there are exactly three structures that prevent 
unique recoverability. De Bruijn graphs are described in van Lint and Wilson 
(1992). Theorem 6.7.2 is from Arratia et al. (1996), where more detailed versions 
are also given. This bound is improved by Shamir and Tsur (2001). In Arratia 
et al. (1996), a more general result is derived for general alphabets, and explicit 
bounds are obtained. These bounds can be used to approximate the probability 
of unique recoverability. Arratia et al. (2000) have obtained results on the 
number of possible reconstructions for a given sequence (when the reconstruction 
is not unique). 


Some probabilistic and statistical tools. Stein’s method for the normal approx- 
imation was first published by Stein (1972). Rinott and Rotar (1996) applied 
it to obtain multivariate normal approximations with a bound on the error in 
the distance of suprema over convex sets, which yields Theorem 6.8.1. Indeed, 
Rinott and Rotar (1996) derive the result for more general classes of test func- 
tions. 

First published by Chen (1975) as the Poisson analog to Stein’s method for 
normal approximations (Stein (1972)), the Chen-Stein method for Poisson ap- 
proximation has found widespread application; word counts being just one of 
them. A friendly exposition is found in Arratia et al. (1989) and a description 
with many examples can be found in Arratia et al. (1990) and Barbour et al. 
(1992). The key theorem for word counts in stationary Markov chains is Theo- 
rem 1 in Arratia et al. (1990) with an improved bound by Barbour et al. (1992) 
(Theorem 1.A and Theorem 10.A), giving Theorem 6.8.2. 

Much of the subsection on direct compound Poisson approximation is based 
on the overview of Barbour and Chryssaphinou (2001). This approach started 
with Barbour et al. (1992); see also Roos (1993), Barbour and Utev (1998). 
Recently much attention has been given to this problem, and the reader is 
referred to the references in Barbour and Chryssaphinou (2001). 

For 63, in Barbour and Chryssaphinou (2001) there is an additional, alter- 
native quantity given in terms of the Wasserstein distance between two dis- 
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tributions. Instead of Condition (6.8.5), improved bounds on Hp and Hy are 
also available under the condition that m~!(mz — m1) < 1/2, where mz = 
Sos k?vR, see Barbour and Chryssaphinou (2001). Barbour and Chryssaphi- 
nou (2001) also obtain Theorem 6.8.3, which in their paper is also phrased in the 
Kolmogorov distance, and slightly more general, and Theorem 6.8.4. Barbour 
and Chryssaphinou (2001) also provide refined versions of this approach as well 
as results in Kolmogorov distance. Barbour and Mansson (2002) give related 
results in Wasserstein distance. 

A short outline of moment-generating functions can be found e.g., in Rice 
(1995). Theorem 6.8.5 on the delta method can be found for example on p.313 
in Waterman (1995). The large deviation principle Theorem 6.8.6 for Markov 
chains can be found on p.78 in Bucklew (1990). The martingale central limit 
theorem 6.8.7 is in Dacunha-Castelle and Duflo (1983) p.80. 


General tools. The autocorrelation polynomial was introduced by Guibas and 
Odlyzko (1980); see also Li (1980), Biggins and Cannings (1987). The result 
that two words commute if and only if they are powers of the same word can 
be found in Lothaire (1997). The Perron—Frobenius Theorem used on page 
255 is classical; see for example Karlin and Taylor (1975). The Chi-square test 
for independence is textbook material in statistics; Rice (1995) gives a good 
exposition. The case of general order Markov chains is reviewed in Billingsley 
(1961). However, for higher order, a longer sequence of observations is required 
(see Guthrie and Youssef (1970)). For an introduction to martingales, see, e.g., 
Chung (1974). The Law of Iterated Logarithm for Markov chains is due to 
Senoussi (1990). 


Genome analysis. The first analysis of the restriction sites in E. coli was car- 
ried out by Churchill et al. (1990) while analyzing the distance between those 
sites. Avoidance of restriction sites in EF. coli was first presented by Karlin et al. 
(1992). The Cross-over Hotspot Instigator sites are very important for several 
bacteria (see Biaudet et al. (1998), Chedin et al. (1998), Sourice et al. (1998)). 
Their significant abundances have been first showed in Schbath (1995b) for E. 
coli and then in El Karoui et al. (1999) for other bacteria. Several papers aim 
at identifying over- and under-represented words in a particular genome (for 
instance, Leung et al. (1996), Rocha et al. (1998)). They usually use the maxi- 
mal Markov model (see also Brendel et al. (1986)). The Poisson approximation 
used in BLAST to approximate the p-value of a sequence alignment was first 
proposed in Altschul et al. (1990), and proven in Karlin and Dembo (1992). The 
variational composition of a genome have been studied with HMMs by Churchill 
(1989), Muri (1998), Durbin et al. (1998). 
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Analytic Approach to Pattern 
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7.0. Introduction 


Repeated patterns and related phenomena in words are known to play a central 
role in many facets of computer science, telecommunications, coding, data com- 
pression, and molecular biology. One of the most fundamental questions arising 
in such studies is the frequency of pattern occurrences in another string known 
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as the text. Applications of these results include gene finding in biology, code 
synchronization, user search in wireless communications, detecting signatures 
of an attacker in intrusion detection, and discovering repeated strings in the 
Lempel-Ziv schemes and other data compression algorithms. 

The basic pattern matching is to find for a given (or random) pattern w or 
a set of patterns W and text X how many times W occurs in the text and how 
long it takes for W to occur in X for the first time. These two problems are 
not unrelated as we have already seen in Chapter 6. Throughout this chapter 
we allow patterns to overlap and we count overlapping occurrences separately. 
For example, w = abab occurs three times in the text = bababababb. 

We consider pattern matching problems in a probabilistic framework in 
which the text is generated by a probabilistic source while the pattern is given. 
In Chapter 1 various probabilistic sources were discussed. Here we succinctly 
summarize assumptions adopted in this chapter. In addition, we introduce a 
new general source known as a dynamic source recently proposed by Vallée. In 
Chapter 2 algorithmic aspects of pattern matching and various efficient algo- 
rithms for finding patterns were discussed. In this chapter, as in Chapter 6, 
we focus on analysis. However, unlike Chapter 6, we apply here analytic tools 
of combinatorics and analysis of algorithms to discover general laws of pattern 
occurrences. An immediate consequence of our results is the possibility to set 
thresholds at which a pattern in a text starts being (statistically) meaningful. 

The approach we undertake to analyze pattern matching problems is through 
a formal description by means of regular languages. Basically, such a descrip- 
tion of contexts of one, two, or several occurrences gives access to expecta- 
tion, variance, and higher moments, respectively. A systematic translation into 
generating functions of a complex variable z is available by methods of ana- 
lytic combinatorics deriving from the original Chomsky-Schitzenberger theo- 
rem. Then, the structure of the implied generating functions at a pole, usually 
at z = 1, provides the necessary asymptotic information. In fact, there is 
an important phenomenon of asymptotic simplification where the essentials of 
combinatorial-probabilistic features are reflected by the singular forms of gener- 
ating functions. For instance, variance coefficients come out naturally from this 
approach together with a suitable notion of correlation. Perhaps the originality 
of the present approach lies in such a joint use of combinatorial-enumerative 
techniques and of analytic-probabilistic methods. 


There are various pattern matching problems. In its simplest form, the pattern 
W = w is a single string w and one searches for some/all occurrences of w as 
a block of consecutive symbols in the text. This problem is known as the exact 
string matching and its analysis is presented in Section 7.2 (cf. also Chapter 6). 
We adopt a symbolic approach, and first describe a language that contains all 
occurrences of w. Then we translate this language into a generating function 
that will lead to precise evaluation of the mean and the variance of the number 
of occurrences of the pattern. Finally, we prove the central and local limit laws, 
and large deviations. 

In the generalized string matching problem the pattern W is a set rather 
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than a single pattern. In its most general formulation, the pattern is a pair 
(Wo, W) where Ws is the so called forbidden set. If Wo = 0, then W appears in 
the text whenever a word from W occurs as a string with overlapping allowed. 
When Wo # @ one studies the number of occurrences of strings in W under 
the condition that there is no occurrence of a string from Wo in the text X. 
This could be called a restricted string matching since one restricts the text 
to those strings that do not contain strings from Wp. Finally, setting W = 0 
(with Wo 4 ) we search for the number of text strings that do not contain 
any pattern from Wo. In particular, for @ < k if Wo consists of runs of zeros of 
length at least @ and at most k, then we deal with the so called (¢,k) sequences 
that find application in magnetic recoding. 

We shall present a complete analysis of the generalized string matching prob- 
lem in Section 7.3. We first consider the so called reduced set of patterns in which 
a string in W cannot be a substring of another string in W. We shall general- 
ize our combinatorial language approach from Section 7.2 to derive the mean, 
variance, central and local limit laws, and large deviations. Then we analyze 
the generalized string pattern matching with Wo = @ and adopt a different 
approach. We shall construct an automaton to recognize the pattern W that 
turns out to be a de Bruijn graph. The generating function of the number of 
occurrences will have a matrix form with the main matrix representing the tran- 
sition matrix of the associated de Bruijn graph. Finally, we consider the (4, k) 
sequences and enumerate them leading to the Shannon capacity. 

In Section 7.4 we discuss a new pattern matching problem called the sub- 
sequence pattern matching or the hidden pattern matching. In this case the 
pattern W = a ,a2---@m, where a; is a symbol of the underlying alphabet, is to 
occur as a subsequence rather than a string (consecutive symbols) in a text. We 
say that W is hidden in the text. For example, date occurs as a subsequence in 
the text hidden pattern, in fact four times, but not even once as a string. The 
gaps between occurrences of W may be bounded or unrestricted. The extreme 
cases are: fully unconstrained problem where all gaps are unbounded; and the 
fully constrained problem where all gaps are bounded. We analyze these and 
mixed cases. 

In Section 7.5 we generalized all of the above pattern matching problems 
and analyze the generalized subsequence problem. In this case, the pattern is 
W = (Wi,...,;Wa) where W; is a collection of strings (a language). We say 
that the generalized pattern W occurs in the text X if X contains W as a sub- 
sequence (W1,W2,...,Wa) where w; € W;. Clearly, it includes all the problems 
discussed so far. We shall analyze this generalized pattern matching for gen- 
eral probabilistic dynamic sources (which include among others Markov sources 
and mixing sources). The novelty of the analysis lies in translating probabili- 
ties into composition of operators. Under a mild decomposability assumption, 
these operators entertain spectral representations that allows us to derive precise 
asymptotic behavior for quantities of interest. 

Finally, in the last section we study a different pattern matching, namely the 
one in which the pattern is part of the (random) text. We coin the term self- 
repetitive pattern matching. More precisely, we look for the longest substring of 
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the text occurring at a given position that has another copy in the text. This 
new quantity, when averaged over all possible positions of the text, is actually 
the typical depth in a suffix trie (cf. Chapter 2) built over (randomly generated) 
text. We analyze it using analytic techniques such as generating functions and 
the Mellin transform. We reduce its analysis to the exact pattern matching; 
thus we call the technique the string-ruler method. In fact, we prove that the 
probability generating function of the depth in a suffix trie is asymptotically 
close to the probability generating function of the depth in a trie that is built 
over n independently generated texts. Such tries have been extensively studied in 
the past and we have pretty good understanding of their probabilistic behaviors. 
This allows us to conclude that the depth in a suffix trie is asymptotically 
normal. 


7.1. Probabilistic models 


We study here pattern matching in a probabilistic framework in which the text 
is generated randomly. Let us first introduce some general probabilistic models 
of generating sequences. The reader is also referred to Chapter 1 for a brief 
introduction to probabilistic models. For the convenience of the reader, we 
repeat here some definitions. 

Throughout we shall deal with sequences of discrete random variables. We 
write (X;,)?2, for a one-sided infinite sequence of random variables; however, we 
often abbreviate it as X provided it is clear from the context that we are talking 
about a sequence, not a single variable. We assume that the sequence (X,)?, 
is defined over a finite alphabet A = {aj,...,av} of size V. A partial sequence 
is denoted as X? = (Xm,...,Xn) for m <n. Finally, we shall always assume 
that a probability measure exists, and we write P(a?) = P(X, = zz, 1 < 
k <n, x, € A) for the probability mass, where we use lowercase letters for a 
realization of a stochastic process. 

Sequences are generated by information sources, usually satisfying some con- 
straints. We also call them probabilistic models. Throughout, we assume the 
existence of a stationary probability distribution, that is, for any string w the 
probability that the text X contains an occurrence of w at position k is equal 
to P(w) independently of the position k. For P(w) > 0, we denote by P(u | w) 
the conditional probability equals P(wu)/P(w). 

The most elementary source is a memoryless source also known as the 
Bernoulli source. 


(B) MEMORYLESS OR BERNOULLI SOURCE 


Symbols of the alphabet A = {a1,...,ay } occur independently of one an- 
other; thus X = X1,X2X3...can be described as the outcome of an infinite 
sequence of Bernoulli trials in which P(X; = a;) = p; and = 5 pi = 1. 
Throughout, we assume that at least for one i we have 0 < p; < 1. 


In many cases, assumption (B) is not very realistic. When this is the case, 
assumption (B) may be replaced by: 
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(M) Markov SOURCE OF ORDER ONE 


There is a Markovian dependency between consecutive symbols in a string; 
that is, the probability pi; = P(Xz41 = a;|X~ = ai) describes the condi- 
tional probability of sampling symbol a; immediately after symbol a;. We 
denote by P = {pig Via the transition matrix, and by w = (m,...,7v) 
the stationary vector satisfying wP = py. (Throughout, we assume that the 
Markov chain is irreducible and aperiodic.) A general Markov source of 
order r is characterized by the transition matrix V" x V with coefficients 
being P(j € A|u) for ue A”. 


In some situations more general sources must be considered (for which one 
still can obtain reasonably precise analysis). Recently, Vallée introduced new 
sources called dynamic sources that we briefly describe here and use in the 
analysis of the generalized subsequence problem in Section 7.5. To introduce 
such sources we start with a description of a dynamic system defined by: 


e A topological partition of the unit interval J := (0,1) into a disjoint set 
of open intervals Z,,a € A. 


e An encoding mapping x which is constant and equal to a € A on each Ty. 


e A shift mapping T': Z— TZ whose restriction to Z, is a bijection of class 
C? from Z, to Z. The local inverse of T restricted to Z, is denoted by ha. 


Observe that such a dynamic system produces infinite words of A® through 
the encoding y. For an initial « € Z the source outputs a word, say w(x) = 
(X2)XT2y..+)s 


(DS) PROBABILISTIC DYNAMIC SOURCE 


A source is called a probabilistic dynamic source, if the unit interval of a 
dynamic system is endowed with a density f. 


EXAMPLE 7.1.1. A memoryless source associated with the probability distri- 
bution {p;}!_, (where V can be finite or infinite) is modeled by a dynamic source 
in which the components w;(a) = yT*« are independent and the corresponding 
topological partition of Z is defined as 


Im = (dm; Gm-+1]; dm = ye Pj. 
j<m 


In particular, a symmetric V-ary memoryless source can be described as 
T(x) = {Vz}, x(x) = [Va], 


where || is the integer part of x and {x} = x — |x| is the fractional part of x 
(cf. Figure 7.1(a)). 
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(a) 


Figure 7.1. Dynamic Sources discussed in Example 7.1.1: (a) memo- 
ryless with the shift mapping Tim(x) 


((a — qm)/Pm+i) (b) continued 
fraction source with Tin(a) = 1/a —m = (1/z). 


Here is another example of a source with memory related to continued frac- 
tions. The alphabet A is the set of all natural numbers and the partition of Z is 


defined as TZ, = (1/(m+1),1/m). The restriction of T to Z is the decreasing 
linear fractional transformation T (a) = 1/a—m, that is, 


T(x) = {1/2}, x(x) = [1/2]. 


Observe that the inverse branches h,, are defined as hy,(x) = 1/(a +m) (cf. 
Figure 7.1(b)). 


Let us observe that a word of length k, say w = w,w2-:-w,» is associated 
with the mapping hy := hy, hw, :++Ohw, Which is an inverse branch of T*. In 
fact, all words that begin with the same prefix w belong to the same fundamental 
interval defined as Z = (hw(0), hw(1)). Furthermore, for probabilistic dynamic 
sources with the density f, one easily computes the probability of w as the 
measure of the interval Z,,. 


The probability P(w) of a word w can be explicitly computed through the 
special generating operator G,, define as follows 


Gulf] = lho Olf o hw(t). (7.1.1) 
One recognizes in G,,[f](t) a density mapping, that is, G.,[f](#) is the density 
of f mapped over h.,(t). The probability of w can then be computed as 


hy (1) 
/ f(t)dt 
hw (0) 
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=f Wo@lfohu(ode= f Gulsleat. (7.1.2) 
0 0 
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Let us now consider a concatenation of two words w and u. For memory- 
less sources P(w-u) = P(w)P(u). For Markov sources one still obtains the 
product of conditional probabilities. Dynamic sources replaces the product of 
probabilities by the product (composition) of generating operators. To see this, 
we observe that 

Guu = Guo Gu, (7.1.3) 


where we write Gy := Gu [f](t). Indeed, hwy = hw ohy and Guu = hi, o hu: 
hi fohwohy, while G,, = hi. fohw and then Gy,oGy = hy-hi,ohu: fohwohu, 
as desired. 


7.2. Exact string matching 


In the exact string matching problem the pattern w = w ,w2-:-Wm of length 
m is given while the text X = Xj’ = X1...X»y of length n is generated by a 
random source. Observe that since the pattern W is given, its length m will not 
vary with n when n — oo (asymptotic analysis). 

There are several parameters of interest in the string matching, but two of 
them stand out. Namely, the number of times w occurs in X which we denote 
as N, := N,(w) and define formally by 


N,(w) = Card({t: X}_ ay = Ww, m <7 < n}). 
We can write N,,(w) in an equivalent form as follows 
Np(w) = Im + Imei t-++ + In (7.2.1) 


where J; = 1 if w occurs at position 7 and J; = 0 otherwise. 
The second parameter is the waiting time T, defined as the first time w 
occurs in the text X, that is, 


Ty = min{n: XP mii = w}- 


One can also define T; as the minimum length of the text in which the pattern 
w occurs 7 times. Clearly, T,, = T,. These parameters are not independent 
since 

{Tw > n} = {N,(w) = O}. (7.2.2) 


More generally, 
{Tj <n} = {Nalw) > jh. (7.2.3) 


Relation (7.2.3) is called the duality principle in Chapter 6. 

Our goal is to estimate the frequency of pattern occurrences N,, in a text 
generated by a Markov source. We allow patterns to overlap when counting 
occurrences (e.g., if w = abab, then it occurs twice in X = abababb when 
overlapping is allowed; it occurs only once if overlapping is not allowed). We 
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study probabilistic behavior of N,, through two generating functions, namely: 


n>0 
N(z,u) = xy N,(z)u" = b> yy P(N, (w) = r)z"u" 
al r=1n=0 


that are defined for |z| <1 and |u| < 1. 

Throughout this section we adopt a combinatorial approach to string match- 
ing, that is, we use combinatorial calculus to find combinatorial relationships 
between sets of words satisfying certain properties (i-e., languages). Alterna- 
tively, we could start with the representation (7.2.1) and use probabilistic tools 
along the lines already discussed in Chapter 6. 


7.2.1. Languages representations 


We start our combinatorial analysis with some definitions. For any language £ 
we define its generating function L(z) as 


L(z)= » P(u)zi4l, 


uel 


where P(u) is the stationary probability of u occurrence, |u| is the length of u, 
and we assume that P(e) = 1. Notice that L(z) is defined for all complex z 
such that |z| < 1. In addition, we define the w-conditional generating function 


of £ as 7 
Lw(z) = a P(ulw)z!"! = > Sop 


uel uel 


Since we allow overlaps, the structure of the pattern has a profound impact 
on the number of occurrences. To capture this, we introduce the autocorrelation 
language and the autocorrelation polynomial. Given a string w, we define the 
autocorrelation set S as: 


S= {wey : wt = Wm—E+i}- (7.2.4) 


By P(w) we denote the set of positions k > 1 satisfying wf = w?_,,,,. In other 
words, if w = vu and w = ux for some words v, x and u, then x belongs to S 
and |u| € P(w). Notice that « € S. The generating function of the language S is 
denoted as $(z) and we call it the autocorrelation polynomial. Its w-conditional 
generating function is denoted S,,(z). In particular, for Markov sources (of 
order one) 

Sw(z) = > Pow, | wh2™*. (7.2.5) 

keP(w) 


Before we proceed, let us present a simple example illustrating the definitions 
introduced so far. 
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EXAMPLE 7.2.1. Let us assume that w = aba over a binary alphabet A = 
{a,b}. Observe that P(w) = {1,3} and S = {e, ba}, where ¢ is the empty word. 
Thus, for the unbiased memoryless source we have $(z) = 1+ a, while for the 
Markovian model of order one, we obtain Sgpa(Z) = 1+ PabPbaz?- 


Our goal is to estimate the number of pattern occurrences in a text. Alter- 
natively, we can seek the generating function of a language that consists of all 
words containing some occurrences of w. Given a pattern w, we introduce the 
following languages: 


(i) J; as a set of words containing exactly r occurrences of w. 


(ii) R as a set of words containing only one occurrence of w, located at the 
right end. 


(iii) U defined as 
U={u: w-uEe Ti}, (7.2.6) 


that is, a word u € U if w- wu has exactly one occurrence of w at the left 
end of w- uw. 


(iv) M defined as 
M = {v: w-v€ J and w occurs at the right end of w- v}, 


that is, M is a language such that any word in w-M has exactly two 
occurrences of w at the left and right end. 


EXAMPLE 7.2.2. Let A = {a,b} and w = abab. Then r = aaabab € FR since 
there is only one occurrence of w at the right end of r. Also, u = bbbb © U 
since wu has only one occurrence of w at the left end; but v = abbbb ¢ U 
since wu = abababbbb has two occurrences of w. Furthermore, ab € M since 
wm = ababab € 72 has two occurrences of w at the left and the right ends. 
Finally, ¢ = bbabababbbababbb € 73 and one observes that t = rm ,m2u where 
r = bbabab € R, m, = ab € M, mz = bbabab € M, and u= bbE U. 


We now describe languages T>; = Us, 7; (set of words containing at least 
once occurrence of w) and T,. in terms of R, M, andl. Recall that M” denotes 
the concatenation of r languages M, and M° = {ce}. Also, M* = U;>1M" and 
M* = UrsoM". 


THEOREM 7.2.3. The languages T,. for r > 1 and Ts, satisfy the relations 


T= RM df (7.2.7) 
and therefore 
To, =R:-M*-U. (7.2.8) 
In addition, we have: 
To-w=R:-S. (7.2.9) 
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Proof. To prove (7.2.7), we obtain our decomposition of J;. as follows: The first 
occurrence of w in a word belonging to 7; determines a prefix p € T,. that is in 
Rk. After concatenating a nonempty word v we create the second occurrence of 
w provided v € M. This process is repeated r — 1 times. Finally, after the last 
w occurrence we add a suffix u that does not create a new occurrence of w, that 
is, wu is such that u € YU. Clearly, a word belongs to Ts, if for some 1 < r < co 
it is in 7,. 
The derivation of (7.2.9) is left to the reader as Exercise 7.2.1. 


EXAMPLE 7.2.4. Let w=TAT. The following string belongs to 73: 


R u 
=o -_———~ 
CCTAT AT GATATGGA. 

Ye 
M M 


We now prove the following result that summarizes relationships between 
the languages R, M, and U. 


THEOREM 7.2.5. The languages M, R, and U satisfy 


M* = A*-w48, (7.2.10) 
U.-A=M+4U — {e}, (7201) 
w(M—e)=A-R-R. (7.2.12) 


Proof. We first deal with (7.2.10). Clearly, A*w contains at least one occurrence 
of w on the right, hence A*w C M*. Furthermore, a word v in M* is not in 
A* -w if and only if its size |v| is smaller than |w] (e.g., think of v = ab € M for 
w = abab). Then the second w occurrence in wv overlaps with w, which means 
that v is in S. 

Let us turn now to (7.2.11). When one adds a character a € A right after a 
word u from U/, two cases may occur. Either wua still does not contain a second 
occurrence of w (which means that ua is a nonempty word of UYU) or a new w 
appears, clearly at the right end. HenceU-AC M+U-—ce. Let nowv € M-e, 
then by definition wu € Tz CUA —U which proves (7.2.11). 

We now prove (7.2.12). Let now x = ar be a word in w- (M — €) where 
aéA. As & contains exactly two occurrences of w located at its left and right 
ends, r is in R and x is in A-R-—R, hence w(M —¢) CA-R-R. To prove 
A-R-RCw(M —¢), we take a word arw from A-R that is not in R. Then 
arw contains a second w occurrence starting in ar. As rw is in R, the only 
possible position is at the left end, and then « is in w(M —«). This proves 
(73.12), : 
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7.2.2. Generating functions 


The next step is to translate the relationships between languages into the as- 
sociated generating functions. Therefore, we must now select the probabilistic 
model according to which the text is generated. We derive our results for a 
Markov model of order one. We adopt the following notation: To extract a 
particular element, say with index (7,7), from a matrix, say P, we shall write 
[P]ij = pi,j. We also recall that (I— P)~' = >. P* provided ||P|| < 1 fora 
matrix norm |] - ||. We also write II for the stationary matrix that consists of 
V identical rows equal to p. Finally, by Z we denote the fundamental matrix 
Z = (I— (P —II))~! where I is the identity matrix. 

The next lemma translates the relationships between languages (7.2.10)— 
(7.2.12) into generating functions M,,(z), Uw(z) and R(z) of languages M, U 
and R (we recall that the first two generating function are conditioned on w 
appearing just before any word from M and U). We define a function F(z) by 


1 
PW) 2" oman = 


Hwy >0 Mw 


n 


F(z) = ((P — I) — (P — Wz)" hn ws 


(7.2.13) 
for |z| <|| P—II ||~+, where j2,,, is the stationary probability of the first symbol 
wy, of w. For memoryless sources F(z) = 0. 


LEMMA 7.2.6. For Markov sources (of order one), the generating functions 
associated with languages M,U, and R satisfy 


Oe ae (7.2.15) 
R(z) = P(w)z™ - Uw (z), (7.2.16) 


provided the underlying Markov chain is aperiodic and ergodic. 


Proof. We first prove (7.2.15). Let us consider language relationship (7.2.11) 
from Theorem 7.2.5, which we rewrite as U- A—U = M—e. Observe that 
wre APab% = 2. Hence, set U - A yields (conditioning on the left occurrence of 


S- P(ub|w)z!?! = > ys P(ulw)z!™! S- Pab2 =U, 


uel be A a€A uel b(u)=a beA 


where ¢(u) denotes the last symbol of the word wu. Of course, M — « and U 
translate into M,,(z) — 1 and U,(z), and (7.2.15) is proved. 

We now turn our attention to (7.2.16), and we use relationship (7.2.12) 
wM —w = AR — R of Theorem 7.2.5. In order to compute the conditional 
generating function of A-R we proceed as follows 


se Bae 3 (abv) z Jabu| _ 3? s s; MaPab x PI (vjv_1 = = b)z!” 


abe A2 bvER acAbeA bvER 
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But due to the stationarity of the underlying Markov chain )7, faPab = [p. AS 
fyP(v|v_1 = b) = P(bv), we get zR(z). Furthermore, w-M — w translates into 
P(w)z™ - (Mu(z) — 1). By just proved (7.2.15), this is P(w)z™ -Uy(z)(z — 1), 
and after a simplification, we obtain (7.2.16). 

Finally, we deal with (7.2.14), and prove it using (7.2.10) from Theorem 7.2.5. 
The left-hand side of (7.2.10) involves language M, hence we must condition 
on the left occurrence of w. In particular, U,..,M" + ¢ of (7.2.10) translates 
Tomy Now we deal with A*-w of the right-hand side of (7.2.10). 
Conditioning on the left occurrence of w, the generating function A,,(z) of 
A* -w is 

Aw (z) = Co ye 2™t™ P(uwlu_1 = Wm) 


n>O0 |ul=n 


= oe 2 2” P(uw|u_-1 = Wm) P(we...Wm|wi)2”” 


n>=0 |ul=n 


We have P(w2...Wm|wi)z™ = a2" P(w), and for n > 0: 
wy 


into 


». P(uwi|u-1 = Wm) >= ical emer 


juj=n 
where, we recall, w, is the last character of w. In summary, the language 
A*-w contributes P(w)z™ En poet agi , while the language S—{e} 


Wm ,W1 
introduces S,,(z) —1. Using the equality P?t! -T = (P —II)"*! (which follows 
from a consecutive application of the identity IP = II), and observing that for 
any symbols a and b 


1 nm nm 
og 2 He = a a, 


n>0 n>0 
2 ab 2 


we finally obtain the sum in (7.2.14). This completes the proof of the theorem. 
r 


The lemma above together with Theorem 7.2.3 suffice to derive generating 
functions N,-(z) and N(z,u) in an explicit form. 


THEOREM 7.2.7. Let w be a given pattern of sizem, and X be a random text 
of length n generated according to an ergodic and aperiodic Markov chain with 
the transition probability matrix P. Define 


Dy(z) = (1 — z)Sw(z) + 2™P(w)(1 + (1 — z) F(2)). (7.2.17) 

Then 
No(z) = ARE) = a (7.2.18) 
N-(2) = R(z) Mi (2) (2), er > 1, (7.2.19) 
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where 
z—1 
1 
Uw(z) = Datla)’ (7.2.22) 
1 
=2"P 22 

R(z) = z™P(w) De) (7.2.23) 

We recall that for memoryless sources, F(z) = 0, and hence 
D(z) = (1 2)8(2) + 2™P(w). (7.2.24) 


Proof. We only comment on the derivation of No(z) since the rest follows directly 
from our previous results. Observe that 


No(z) = > P(Nn = 0)2" = 5" (1— P(Np > 0))2" = —_ — 5° N,(2); 


n>0 n>0 


thus the first expression follows form (7.2.19). The second expression is a direct 
translation of Jo-w = R-A (cf. (7.2.9)) which reads No(z)P(w)z™ = R(z)Sw(z) 
in terms of the appropriate generating functions. 7 


7.2.3. Moments and limit laws 


In the previous section we derived an explicit formula for the generating function 
N(z,U) = nso E(u®")z” and N,(z). These formulas can be used to obtain 
explicit and asymptotic expressions for moments of N,, (cf. Theorem 7.2.8), 
the central limit theorem (cf. Theorem 7.2.11), and large deviations (cf. Theo- 
rem 7.2.12). We start with derivation of the mean and the variance of Ny. 


THEOREM 7.2.8. Under the assumptions of Theorem 7.2.7 and nP(w) > ~, 
one has, forn >m: 


E[Nn(w)] = P(w)(n— m+ 1), (7.2.25) 
and 
Var[N,(w)] = ner +ce2+O(R”), forR>1 (7.2.26) 
where 
cy = P(w)(2S,(1) — 1 — (2m — 1)P(w) + 2P(w)F4)), (7.2.27) 
cg = P(w)((m — 1)(3m — 1)P(w) — (m— 1)(2S(1) — 1) — 2S%,(1)) 
— 2(2m —1)P(w)? Ey + 2E2P(w)’, (7.2.28) 
and the constants E,, E2 are 
1 1 2 2 
A, = [hw [(P _ TY Z sic costtii's fy = ili, es I1)Z [ere 
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Proof. Notice that first moment estimate can be derived directly from the 
definition of the stationary probability of w. In order to grasp higher moments 
we will use analytic tools applied to generating functions. We compute the first 
two moments of N,, from N(z,u) since E(N,,) = [z”]Nu(z,1) and E(N,(N, — 
1)) = [z"|Nuu(z, 1) where Ni(z,1) and Nuu(z,1) are the first and the second 
derivatives of N(z, wu) with respect to variable u at (z,1). By Theorem 7.2.7 we 
find 


_ 2"PWw) 
Nu(z,1) = nay 
Nuu(z,1) = Seo) Male) Dale), 


(1 — 2) 


Now we observe that both expressions admit as a numerator a function that is 
analytic beyond the unit circle. Furthermore, for a positive integer k > 0 


nt+tk— ‘) _ mth) (7.2.29) 


a ( k-1 )~ T®)P@ +1)’ 
(where I(x) is the Euler gamma function), we find for n > m 
E(N,) = [2"]Nu(z,1) = Pw)[z""-™](1 — 2)? = (n— mt 1)P(w). 
In order to estimate variance, we introduce 
®(z) = 22" P(w) My (z)Dw(z), 


and observe that 


B(z) = 0(1) + (2-He(1) + 22" 


@"(1) + (z-1)°F (2), 

where f(z) is the remainder of the Taylor expansion of ®(z) up to order 3 at 
z= 1. For memoryless sources, ®(z) and thus f(z) are polynomials of degree 
2m —2 and [z"](z — 1) f(<) is 0 for n > 2m—1. Hence, by (7.2.29) we arrive at 


(n + 2)(n + 1) 


E(Nn (Nn — 1)) = [2"]Nuu(z, 1) = ®(1) 5 


1 
— @'(1)(n+1)+ 52" (1). 
But My(z)Dw(z) = Dw(z) + (1 — z) and taking into account formula (7.2.24) 
for D(z), we finally obtain (7.2.26). 


For Markov sources, D(z) has an additional term, namely 
2(2°” P(w)? F(z) 
(l-z)? 


[2"] 


where F(z), defined in (7.2.13), is analytic beyond the unit circle for |z| < R, 
with R > 1. The Taylor expansion of F(z) is Ey, + (1 — z)E2, and applying 
(7.2.29) again yields the result. rT 
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Recall that P = II for memoryless sources, so Ey = E2 = 0 and (7.2.26) 
reduces to an equality for n > 2m — 1. Thus 


Var|Nn(w)] = nei + ce (7.2.30) 
with 


ce, = P(w)(2S(1) — 1 — (2m — 1) P(w)), 
cy = P(w)((m — 1)(3m — 1)P(w) — (m— 1)(28(1) — 1) — 28"(1)). 


In passing we should notice that from the generating function N(z,u) we 
can compute all moments of N,,. Instead, however, we present some limit 
laws for P(N, = r) for different values of r: We consider r = O(1), r = 
E(N,,) + #/Var(N,,) (central and local limit regime), and r = (1 + 6)E(N,,) 
(large deviations). From the central limit theorem (cf. Theorem 7.2.11 below) 
we conclude that the normalized random variable (N,, — E(N,,))/./Var(Nn) 
converges also in moments to the moments of the standard normal distribution. 
This follows from the fact that in the theorem below we prove the convergence 
of the normalized generating function to an analytic function, namely ec /2 for 
u complex in the vicinity of zero. Since an analytic function has well defined 
derivatives, convergence in moments follows. We shall leave a formal proof to 
the reader (cf. Exercise 7.2.3). 


THEOREM 7.2.9. Under the assumptions of Theorem 7.2.8, let p, be the root 
of Dw(z) = 0 of the smallest modulus and multiplicity one. Then, pw is real 
such that pw > 1, and there exists p > pw such that for r = O(1) 


r+1 
PWn(w) =) = -1¥a(," Jo +0~™), (72.81) 
j=l 
where ; 
™ P(w) (pw —1)"~ 
Ory = Se (7.2.32) 
(Di (Pw)) 
and the remaining coefficients can be computed according to 
a; = —— lim ca (N,(z)(z — pw)"*") (7.2.33) 
J (r+1— 9)! ope dztt1-3 


with 7 = 1,2,...,r. 
In order to prove Theorem 7.2.9, we need the following simple result. 


LEMMA 7.2.10. The equation D,,(z) = 0 has at least one root, and all its roots 
are of modulus greater than 1. 
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Proof. Poles of Dy(z) = (1 — z)/(1 — Mw (z)) are clearly poles of ToMaGy: As 
Iona is the generating function of a language, it converges for |z| < 1 and has 
no pole of modulus smaller than 1. Since D,,(1) 4 0, then z = 1 is a simple pole 
of 1/(1— My(z)). As all its coefficients are real and non negative, there is no 
other pole of modulus |z| = 1. It follows that all roots of D,,(z) are of modulus 
greater than 1. The existence of a root is guaranteed since D,,(z) is either a 
polynomial (Bernoulli model) or a ratio of polynomials (Markov model). rT 


Proof of Theorem 7.2.9. We first re-write the formula on N,(z) as follows 


N,(z) = ee (7.2.34) 


Observe that P(N,(w) = r) is the coefficient at z” of N,(z). By Hadamard’s 
theorem, asymptotics of the coefficients of a generating function depend on the 
singularities of the underlying generating function. In our case, the generating 
function N,(z) is a rational function, thus we can only expect poles (for which 
the denominator D,,(z) vanishes). Lemma 7.2.10 above establishes the existence 
and properties of such a pole. Therefore, the generating function N,(z) can be 
expanded around its root of smallest modulus, let p, be this smallest modulus, 
in Laurent’s series: 


r+1 ; _ 
N,(z) = y ati + N,(z) (7.2.35) 


where N,.(z) is analytical in |z| < p' and p’ is defined as p! = inf{|p| : p > 
Pw and D,,(p) = 0}. The constants a; satisfy (7.2.33). This formula simplifies 
into (7.2.32) for the leading constant a_,_;. As a consequence of analyticity 
we have for 1 < py <p <p’: [z"|N()(z) = O(p-"). Hence, the term N,(z) 
contributes only to the lower terms in the asymptotic expansion of N,.(z). After 
some algebra, and noting that [z”]1/(1 — z)*t! = pag we prove Theorem 
7.2.9. : 


In the next theorem we establish the central limit theorem in its strong form 
(i.e., local limit theorem). 


THEOREM 7.2.11. Under the same assumption as in Theorem 7.2.8 we have 


P(N,,(w) < E(Nn) + \V/Var(Nn)) = (1 +0 (=)) = I eae. 
(7.2.36) 


If, in addition, pj; > 0 for alli, 7 € A, then for any bounded real interval B 


1 ds? = => 
sup [P(N (t) = [B(Nn) + 2/Var(N)]) — ey | (=) 


as n — ©. 
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Proof. Let r = |E(N,,)+2,/ Var(N,,)| with 2 = O(1). We compute P(N,,(w) < 
r) (central limit theorem) and P(N,,(w) = r) (local limit theorem) for r = 
E(N,,) + 2/ Var(N,,) when « = O(1). Let v, = EVN, (w)) = (n — m+ 1)P(w) 
and o2 = Var(N,(w)) = cin+O(1). To establish normality of (Nn(w)—Un)/on, 
it suffices, according to Lévy’s continuity theorem, to prove the following 

lim e77¥"/9 N,(e7/9") =e /? (7.2.38) 


n—Cco 


for complex T (actually, 7 = iv suffices). Again, by Cauchy’s theorem 


Nn = = = ——EEEEEE ee | : 
“) Qri gntl se Qt ¢ D2(z)\1—uM,(z))zrtt-™ . 


where the integration is along a circle around the origin. The evaluation of this 
integral is standard and it appeals to the Cauchy residue theorem. Namely, we 
enlarge the circle of integration to a bigger one, say R > 1, such that the bigger 
circle contains the dominant pole of the integrand function. Observe that the 
Cauchy integral over the bigger circle is O(R~”). Let us now substitute u = e? 
and z = e®. Then, the poles of the integrand are the roots of the equation 


1 —e'M,,(e?) =0. (7.2.39) 


This equation implicitly defines in some neighborhood of t = 0 a unique C'° 
function p(t), satisfying p(0) = 0. Notably, all other roots p satisfy inf |p| = 
p’ > 0. Then, the residue theorem with e? > R > e? > 1 leads to 


Nn(e’) = C(t)e~t1-™e® 4 O(R-") (7.2.40) 
where ie P(w) 
Di, (p(t) Mi, (p(t) 


To study properties of p(t), we observe that the cumulant formula implies 
E(N,,(w)) = [t]log N,(e’) and o2 = [t?]log N,,(e') where, we recall, [¢”]f(t) 
denotes the coefficient of f(t) at t”. In our case, vy, ~ —np'(0) as well as 
o2 ~ —np"(0). Set now in (7.2.40) t = t/on — 0 for some complex 7. Since 
uniformly in t we have p(t) = tp’(0) + p”(0)t?/2 + O(¢3) for t — 0, our estimate 
(7.2.40) leads to 


2 
e Tn/on N (e7/2n) = exp (S + O(nr*/03)) 


=e" /? (1+ 0(1/Vn)), 


which proves (7.2.36) after applying the Berry-Essen inequality that allows to 
derive the error term O(1/,/n) for the probability distribution. 

To establish the local limit theorem, we observe that if pj; > 0 for all 7,7 € A, 
then p(t) > 0 for t £ 0 (cf. Exercise 7.2.4). We can obtain much more refined 
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local limit result. Indeed, we find for x = o(n'/°) 


1 2 K3 ( =) 
P(N, = E(N,) + 2 f/ney) = ——=—e 2” _ | 1 - —— [| «# - = 
( ( ) 1) prarren ( 28/7 Jn 3 
Os); (7.2.41) 


where «3 a constant (ie., the third cumulant). This completes the proof of 
Theorem 7.2.11. : 


Finally, we establish precise large deviations for N,,. Large deviations play 
a central role in many applications, most notably in data mining and molec- 
ular biology, since it allows to establish a threshold for overrepresented and 
underrepresented patterns. 


THEOREM 7.2.12. Let r = aE[N,] with a = (1+ 6)P(w) for 6 # 0. For 
complex t, define p(t) to be the root of 


1—e'M,,(e?) =0, (7.2.42) 
and define wg and oq by 
—p'(Wa) =a, —p" (Wa) = %. 
Then 


1 
P(N. (we) = 04 EN, )) &@ ieee (79.43) 
Carl 27(n — m+ 1) 


where I(a) = awa + p(wa) and 


P(wyemoles) 


Ch) 2 7.2.44 
DPD) + (1 — cP) Di (PE) mae 


6, = log 


and D,,(z) is defined in (7.2.17). 
Proof. From (7.2.40) we conclude that 


log N,,(ee 
lim 28 ie = — p(t) 
n—-Co nr 
By the Gartner-Ellis theorem we find 
log P(N, 
fagny OS Ne) ate 


n—0o n 


where 
I(a) = awe + p(wa) 


with wa being a solution of —p/(t) = a, A stronger version of the above result 
is possible and we derive it in the sequel. In fact, we use (7.2.41) and the “shift 
of mean” technique. 
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As in the local limit regime, we could use Cauchy’s formula to compute the 
probability P(N, =r) for r = E(N,,) + cO(/n). But, formula (7.2.41) is only 
good for « = O(1) while we need x = O(./n) for the large deviations. To 
expand its validity, we shift the mean of the generating function N,,(u) to a new 
value, say m = an = (1+6)P(w)(n —m-+ 1), so we can again apply the central 
limit formula (7.2.41) around the new mean. To accomplish this, let us re-write 
(7.2.40) as for any R > 0 


Nn(e) = CH[g(t)]"-"* + O(R™) 


where g(t) = e~?). (In the derivation below, for simplicity we dropped O(R~") 
term.) The above suggests that N,,(e’) is the moment generating function of a 
sum S;, of n —m-+1 “almost” independent random variables X1,...,Xn—m+1 
having moment generating function equal to g(t) and Y whose moment gener- 
ating function is C(t). Observe that E(S,,) = (n — m+ 1)P(w) while we need 
to estimate the tail of S,, around (1+ 6)(n —m-+1)P(w). To achieve it, we 
introduce a new random variable X; whose moment generating function g(t) is 


- g(t + w) 
g(w) 


where w will be chosen later. Then, the mean and the variance of the new 
variable X is 


gw) 
_ 9) _ (90)? __ ng, 
va) = Fp (ah) =e 


pl wa) = LEY = a = Plw(1 +6). 


Then, the new sum §, —Y = X,+...+Xn—m+41 has a new mean (1+06)P(w)(n— 
m-+1) = a(n—m-+1), and hence we can apply to S,,—Y the central limit result 
(7.2.41). To translate from S,,— Y to S, we use the following simple formula 


t n ™(w tM Me t Ww 
[e**] (g"@)) = oe) fe] Gaoe (7.2.45) 
where M = aoe m +1) and [e’"]g(t) denotes the coefficient of g(t) at 2” = e'” 


t 
(where z = e'). Now, we can apply (7.2.41) to the right-hand side of the above 
to obtain 


ie) (oe gv an *)) er 
Oa\/ 20 (n — m+ Oar/2n(n —m+1) 


To obtain the final result we must take into account the effect of Y whose 
moment generating function is C(t). This leads to replacing a = 14+ 6 by a= 
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Table 7.1. Z score vs p-value of tandem repeats in A.thaliana. 


Oligomer Obs. p-val Z-SC. 
(large dev.) 


| (Aarrecces| | 2 | a 
[TITCTACCA J 3_[ 4350x 10 | 22.96 | 


| ACGGTTCAC_|_3_| 2.265 x 10-* | 55.49 


[Accacccrr [4 | 1604x107 [74.01 
[ACGCTTGE [4] 5.374 x 10 | 8193 
[[GAGAAGACG [5 [0.68710 [151.10 


1+6+C"(0)/n resulting the the correction term e® = eC”, Theorem 7.2.12 
is proved. rT 


We illustrate the above results on an example taken from molecular biology. 


EXAMPLE 7.2.13. Biologists apply the so called Z-score and p-value to deter- 
mine whether biological sequences such as DNA or protein contain a biological 
signal, that is, an underrepresented or overrepresented patterns. These quanti- 
ties are defined as 


VVa Te (w)) ” 
). 


peal(r) = P(Nn(w) >r 


Z-score indicates by how many standard deviations the observed value N,,(w) 
is away from the mean. Clearly, this score makes sense only if one can prove, as 
we did in Theorem 7.2.11, that Z satisfies (at least asymptotically) the Central 
Limit Theorem (CLT). On the other hand, p-value is used for rare occurrences, 
far away from the mean where one needs to apply the large deviations as in 
Theorem 7.2.12. 

The range of validity of Z-score and p-value are important as illustrated 
in Table 7.2.13 where results for 2008 nucleotides long fragments of A.thaliana 
(a plant genome) are presented. In the table for each 9-mer the number of 
observations is presented in the first column following by the large deviations 
probability computed from Theorem 7.2.12 and Z-score. We observe that for 
AATTGGCGG and AAGACGGTT the Z-scores are about 48 while p-values 
differ by two order of magnitudes. In fact, occurrences of these 9-mers are very 
rare, and therefore Z-score is not an adequate measure. 


7.2.4. Waiting times 


We shall now discuss the waiting times T,, and 7T;, where T,, = 7; is the first 
time w occurs in the text, while 7} is the minimum length of the text in which 
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w occurs j times. Fortunately, we do not need re-derive generating function of 
T; since, as we have already indicated in (7.2.3), the following duality principle 
holds 


and in particular, {T,, >} = {N, = 0}. Therefore, if 


T(u,z) = x > PCT; =n)z"w, 


n>0 j>0 
then by the duality principle we have 
(1 —wu)T(u, z) + u(1 — z)N(z,u) = 1, 


and one obtains T(u, z) from Theorem 7.2.7. Waiting times were analyzed in 
depth in Chapter 6. 
Finally, observe that the above duality principle implies 


E(Tw) = 5 P(Nn = 0) = No(1). 
n>0 


In particular, for memoryless sources, from Theorem 7.2.7 we conclude that 


_ zm S(z) 
NETS (1—z)S(z) + 2™P(w) 
Hence 
B(Tn) = > P(Na(w) =0) = Nat) = 5 


1 i 1 
= m=smaxt » - (7.2.46) 
kEP(w) 1 


7.3. Generalized string matching 


In this section we consider generalized pattern matching in which a set of pat- 
terns (rather than a single pattern) is given. We assume that the pattern is a 
pair of sets of words (Wo, W) where W = es W; consists of sets W; Cc A™ 
(i.e., all words in W; have a fixed length equal to m;). The set Wo is called 
the forbidden set. For Wo = @ one is interested in the number of pattern oc- 
currences, N,,(W), defined as the number of patterns from W occurring in the 
text Xj’ generated by a (random) source. Another parameter of interest may be 
the number of positions in Xj’ where a pattern from W appears (clearly, some 
patterns may occur more than once at some positions). The latter quantity we 
denote as II,. If we define nm ) as the number of positions where a word from 
W, occurs, then 
Np(W) = TW +... 4+ 1, 
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Notice that at any given position of the text and for a given i only one word 
from W; can occur. 

For Wo # @ one studies the number of occurrences N,,(VW) under the condi- 
tion that N,;,(Wo) := 1 = 0, that is, there is no occurrence of a pattern from 
Wo in the text Xj’. This could be called a restricted pattern matching since one 
restricts the text to those strings that do not contain strings from Wp. 

Finally, we may set W; = 0 for i = 1,...,d with Wo 4 @ and count the 
number of text strings that do not contain any pattern from Wo. (Alternatively, 
we can estimate the probability that a randomly selected text Xj does not 
contain any pattern from Wo.) In particular, define for 0 < k 


Wo = {0...0,...,0...0}, (7.3.1) 
£ k 


that is, Wo consists of runs of zeros of length at least @ and at most k. A text 
satisfying the property that no pattern from Wo defined in (7.3.1) occurs in it 
is called a (€,k) sequence. Such sequences are used for magnetic coding. 

In this section, we first present an analysis of the generalized pattern match- 
ing with Wy = @ and d= 1 that we call the reduced pattern set (i.e., no pattern 
is a substring of another pattern) followed by a detailed analysis of the gen- 
eralized pattern matching. We describe two methods of analysis. First, we 
generalize our language approach from the previous section, and then for the 
general pattern matching case we apply de Bruijn’s automaton and spectral 
analysis of matrices. Finally, we enumerate (¢,/) sequences and compute the so 
called Shannon capacity for such sequences. 

Throughout this section we assume that the text is generated by a (non-de- 
generate) memoryless source (B), as defined in Section 7.1. 


7.3.1. String matching over reduced set of patterns 


We analyze here a special case of the generalized pattern matching with Wo = 0 
and d = 1. In this case we shall write W; := W = {wi,...,w«} where w; 
(1 <i < K) are given patterns with fixed length |w;| = m. We shall generalize 
the results from the exact pattern matching section, but we omit most of the 
proofs or move them to exercises. 

As before, let Ts; be a language of words containing at least one occurrence 
from the set W, and for any nonnegative integer r, let J; be the language of 
words containing exactly r occurrences from W. In order to characterize TJ, we 
introduce some additional languages for any 1 < i,j < K: 


e M;; ={v: wu € FE and w, occurs at the right end of v}; 


e RF, defined as the set of words containing only one occurrence of w;, located 
at the right end; 


e U; ={u: wu € 7}, that is, a set of words u such that the only occurrence 
of w; € W in w;u is on the left. 
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We also need to generalize the autocorrelation set and the autocorrelation 
polynomial to a set of patterns. For any given two strings w and u, let 


= m ym ee. 
Swyu = {Uei1 ! Um—k4i = Ui} 


be the correlation set. The set of positions k satisfying u* = w™_,,, is denoted 
8 uy m—k+1 


as P(w,u). Ifw=a-vandu=v-y for some words @, y, v, then y € Sy», and 
|u| € P(w, wu). The correlation polynomial, Si, (z), of w and u is the associated 
generating function of S,,, that is, 


Sw,u(z) = > Pula”. 
kEP(w,u) 


In particular, for w;,w; € W we define S;j3 := Sw,;,w;. The correlation matrix 
of W is denoted as 8(z) = {Su,w, (2) }i,j=1,K- 


EXAMPLE 7.3.1. Consider a DNA sequence over the alphabet A = {A,C, 
G, T} generated by a memoryless source with P(A) = #, P(C) = 4, P(G) =% 
and P(T) = 4. Let w; = ATT and we = TAT. Then the correlation matrix 


5 
S(z) is 
a= (1 18). 
lt gl+ i 


In order to analyze the number of occurrences N,,(VV) and its generat- 
ing functions we first generalize the language relationships discussed in The- 
orem 7.2.3. Observe that 


r—1 
ie= SS, Me; 
1Si,jgk 


m=) y, RMU, 


r>11<ij<K 


where }> denotes disjoint union of sets. As in Theorem 7.2.5, one finds the 
following relationships between just introduced languages 


LJ ME, =A*- w+ Siz — 1 <4,9 SK, 
k>1 


Uj-A=|JMiy +i -«, 1<i<k, 


j 
A+R; —(Rj —wj) = JuiMiy, l<7 <4; 
Tow; = Rj + Ri(Sij — €), L<7,9 < Kk. 


Let us now analyze N,,(W) in a probabilistic framework. To simplify our pre- 
sentation, we assume that the text is generated by a memoryless source. Then 
the above language relationships translate directly into generating functions, as 
discussed the last section. 
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Before we proceed, we adopt the following notations. Lower-case letters 
are reserved for vectors which are assumed to be column vectors (e.g., x! = 
(w1,...,%K)) except for vectors of generating functions which we denote by 
uppercase letters (e.g., U*(z) = (Ui (z),..., UK (z)) where U;(z) is the generating 
function of a language U/,,,). In the above the upper index ”t“ denotes transpose. 
We shall use upper-case letters for matrices (e.g., S(z) = {Sw,w; (2) }i,j=1,K)- In 
particular, we write I for the identity matrix, and I* = (1,...,1) for the vector 
of all ones. 

Now we are ready to present exact formulas for the generating function 
Nr(2) = do 50P Nal) =r)2” and N(z,u) = 0,59 N-(z)u". The following 
theorem is a direct consequences of our definitions and language relationships. 


THEOREM 7.3.2. Let W = {wi,...,wkx} be a given set of reduced patterns 
each of length m, and X be arandom text of length n generated by a memoryless 
source. The generating functions N,(z) and N(z,u) can be computed as follows: 


N,(z) = R'(z)M’~*(z)U(z) 
N(z,u) = R*(z)u(I — uM(z))*U(z) , 


where, denoting w' = (P(w1),...,P(wx)) and It = (1,1,...,1), we have 


M(z) = (D(z) + (ze — )DD{(z)"?, (7.3.4) 
(I— M(z))-! = S(z) + = =I sw, (7.3.5) 
U(z) = + 2 —(I- M(@z))-1, (7.3.6) 
R'(z) = wt (I— M(z)), (7.3.7) 


and . 
D(z) = (1 — z)S(z) + 2™1-w'. 


Using these results and following footsteps of our analysis for the exact 
pattern matching, we arrive at the following asymptotic results. 
THEOREM 7.3.3. Let the text X be generated by a memoryless source with 
P(w,) > 0 fori=1,...,K and PW) = Yo ,cw Pui) = ws 1. 
(i) The following holds 
E(Nn(W)) = (n— m+ 
Var(Nn(W)) = (n- m+ 


where S(1) denotes the derivative of the matrix S(z) at z = 1. 
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(ii) Let pw be the smallest root of multiplicity one of det D(z) = 0 outside the 
unit circle |z| <1. There exists p > pw such that for r = O(1) 


Ar =(h—i--Tr 
P(N, (W) = 7) = (1) E (nen 
“ j n —(nt+9 —n 
+ Yi-1a( ) os 1) + O(p*) , 


j=l jot 
where a, are computable constants. 


(iii) Let B be a bounded real interval and r = |E(N,,) + @\/Var(N,,)|. Then 


1 1,2 
sup |P(Nn(W) =) — q===———e* 


reB 4/27 Var(N,,) 


as 1 &. 


(iv) Let r = (1+ 6)E(N,,) with 6 £0, and let a = (1+.6)P(W). Define r(t) to 
be the root of 
det (I — e’M(e")) =0, 


and W, and d, to be 


Then 
P(N (W) = 0) poe tT) +80 


Oa\/ 27(n —m +1) 


where I(a) = aWag+T(Wa) and 0, is a computable constant (cf. Exercise 7.3.3). 


Proof. We only sketch the derivation of part (iii) but we present two proofs. 
Our starting point is 


N(z, u) = R*(z)u(I — uM(z))~'U(z) 


shown in Theorem 7.3.2 to hold for |z| < 1 and |u| < 1. We may proceed in two 
different ways. 


MetTHOoD A: DETERMINANT APPROACH. 


Observe that 
B(z, u) 


aM) aay 


where B(z, u) is a complex matrix. Let 
Q(z, u) := det(I— uM(z)), 
and let zo := p(u) be the smallest root of 


Q(z,u) = det(I— uM(z)) = 0. 
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Observe that p(1) = 1 by (7.3.5). 

For our central limit result, we restrict out interest to p(w) in a vicinity of 
u = 1. Such a root exists and is unique since for real z the matrix M(z) has all 
positive coefficients. The Perron—Frobenius theorem implies that all other roots 
pi(u) are of smaller modulus. Finally, one can analytically continue p(u) to a 
complex neighborhood of u. Thus Cauchy’s formula yields for some A < 1 


a fp oe) dz 


Qri Q(z, u) gntl 


Nn(u) = [2"]N(z,u) = 
= C(u)p "(u)(1 + O(A")) 


where C(u) = —R*(p(u))B(o(u), wU(p(u))e~ *(u)/Q"(o(u),u). As in the proof 
of Theorem 7.2.11, we recognize a quasi-power form for N,,(u) that directly 
leads to the central limit theorem. An application of a saddle point method 
completes the proof of the local limit theorem. 


METHOD B: EIGENVALUE APPROACH 

We apply now the Perron—Frobenius theorem for positive matrices together 
with a matrix spectral representation to obtain even more precise asymptotics. 
Our starting point is the following formula 


(I — uM(z => u™ MF (z (7.3.8) 


Now, observe that M(z) for real z, say x, is a positive matrix since each element 
Mj;(x) is the generating function of the language M,,; and for any v € M,,; we 
have P(v) > 0 for memoryless sources. Let then A1(#), A2(x),...,AK (2) are 
eigenvalues of M(x). By Perron—Frobenius result we know that A1(a) is simple, 
real and A, (x) > |A;(#)| for i > 2. (To simplify our further derivation, we also 
assume that A;(a) are simple but this assumption will not have any significant 
impact on our asymptotics, as we shall see below.) Let 1; andr;,i=1,..., K are 
left and right eigenvectors corresponding to 1 (a), A2(x),..., AK (x) eigenvalues, 
respectively. We set (l1,r1) = 1 where (x,y) is the scalar product of the vectors 
x and y. Since r; is orthogonal to the left eigenvector r; for 7 4 i, we can write 


for any vector x 
K 
x =(h,x)ri+ So iy x) 
i=2 
This yields 
K 
M(a2)x = (1, x)Ai(@)r1 + Dw ire se xy 


Since M* (zx) has eigenvalues A¥ (x), AK(a),..., A%-(a), then — dropping even the 
assumption about eigenvalues A2,...,A« being simple — we arrive at 


M*(x)x = (1, x)ry A" (a) + Sa (Ij, x)rgA* (a) (7.3.9) 
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where q;(k) is a polynomial in k (q;(k) = 1 when the eigenvalues A2,..., AK are 
simple). Finally, we observe that we can analytically continue \;(a) to complex 
plane due to separation of A;(x) form other eigenvalues leading to \(z). 

Applying now (7.3.9) to (7.3.8) and using it in the formula for N(z, u) derived 
in Theorem 7.3.2 we obtain 


N(z,u) = R'(z)u[l — uM(z)]“'U(z) 


= uR*(z) (>. u® F(z) (11 (z), U(z))r1(z) 
k=0 


ae 

+ wk Eile), Ul)nle) 
uC; (z) uC;(z) 

1 — uA1(z) 1— uA;(z) 


for some polynomials C;(z). This representation entails to apply the Cauchy 
formula yielding, as before, for A < 1 and a polynomial B(wu) 


N,(u) == [2"|N(z,u) = B(u)p-"(u) (1 + O(4")) 


where p(w) is the smallest root of 1—uA(z) = 0 which coincides with the smallest 
root of det(I— uM(u)) = 0. In the above A < 1 since \;(z) dominates all the 
other eigenvalues. In the next section we return to this method and discuss it 
in some more depth. a 


7.3.2. Analysis of the generalized string matching 


In this section we deal with a general pattern matching problem where words 
in W are not of the same length, that is, W = (sy W; such that W; is a 
subset of A” with all m; being different. We still keep Wo = @ (ie., there 
are no forbidden words). In the next section, we consider the case Wo # 0. 
We present here a powerful method based on a finite automata (i.e., de Bruijn 
graph). This approach is very versatile, but unfortunately is not as insightful 
as the combinatorial approach discussed so far. 

Our goal is to derive the probability generating function N,,(u) = E(u%»™)) 
of the number of pattern W occurrences in the text. We start with building 
an automaton that scans the text X;X2--:-X, and recognizes occurrences of 
patterns from the set W. As a matter of fact, our automaton is a de Bruijn 
graph that we describe in the sequel: Let M = max{mj,...,ma}— 1 and 
B=A™. The de Bruijn automaton is built over the state space B. Let b € B 
and a € A. Then a transition from a state b upon scanning symbol a of the text 
is to 6 € B such that 

b = bobs -+- bya, 


that is, the leftmost symbol of b is erased and symbol a is appended on the 
right. We shall denote such a transition as ba + b or ba € Ab since the first 
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symbol of b has been deleted when scanning symbol a. When scanning a text 
of length n — M one constructs an associated path of length n — M in the de 
Bruijn automaton that begins at a state formed by the first M symbols of the 
text, that is, b= X1 Xo a Xv. 


Figure 7.2. The de Bruijn graph for W = {ab, aab, aba}. 


To record the number of pattern occurrences we equip the automaton with a 
counter ¢(b, a). When a transition occurs, we increment $(b, a) by the number of 
occurrences of patterns from W in the text ba. Since all occurrences of patterns 
from W that end at a are contained in the text of the form ba, we realize that 


$(b, a) = Nur+i(W, ba) — Nu (W, b) 


where N;(W, x) is the number of pattern occurrences in the text x of length k. 
Having built such an automaton, we construct a transition V” x V™ matrix 
T(u) as a function of a complex variable u and indexed by 6 x B such that 


[T(u)],g = P(a)u®9 [ba € Ad] 


= P(a)uNu+1(W,ba)-NauW.)[ § = bobs --- bya] (7.3.10) 


where Iverson’s bracket convention is used: 


[B] = lif the property B holds, 
le 0 otherwise. 


EXAMPLE 7.3.4. Let W = {ab,aab, aba}. Then M = 2, the de Bruijn graph 
is presented in Figure 7.2, and the matrix T(u) is shown below 


aa {P(a) P(b)u 0 0 


T(u)= ab} 0 0 Pla)u? P(b) 
ba | P(a) P(b) 0 0 
b \ 0 0 Pla) P(b) 


Next, we extend the above construction to scan a text of length k > M. 
By combinatorial properties of matrix products, the entry of index b,b of the 
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power T*(u) cumulates all terms corresponding to starting in state b, ending 
in state b, and recording the total number of occurrences of patterns W found 
upon scanning the last k letters of the text. Therefore, 


[T*(u)] 5 = Sy Pg) ee Naat, (7.3.11) 
vEAk 


Define now a vector x(w) indexed by b as 
ix(u)], = PCO) uO, 


Then, the summation of all the entries of the row vector x(u)*T*(u) is achieved 
by means of the vector I = (1,...,1) so that the quantity x(u)*T(u)*T represents 
the probability generating function of Nx4.7¢(W) taken over all texts of length 
M-+k. By setting n = M +k we prove the following theorem. 


THEOREM 7.3.5. Consider a general pattern W = (Wi,...,Wa) with M = 
max{my1,...,ma}-—1. Let T(u) be the transition matrix defined as 


TMH = P(a)uNu VPN WITH = bobs --- byra] 


where b,b€ AM andaé A. Then 
Np(u) = E(uXn™)) = b*(u)T"(u)T (7.3.12) 
where b‘(u) = x‘(u)T~™(u). Also, 


N(z,u) = yy, Np(z)z” = b*(u)(I — zT(u))3T (7.3.13) 
n>0 


for |z| <1. 


Let us now return for a moment to the reduced pattern case discussed in 
the previous section and compare expression (7.3.13) derived here with (7.3.3) 
of Theorem 7.3.2 that we repeat below 


N(z,u) = R¢(z)u(I — uM(z))~*U(z). 


Although there is a striking resemblance of these formulas they are quite dif- 
ferent. In (7.3.3) M(z) is a matrix of z representing generating functions of 
languages M,,;, while T(w) is a function of wu and it is the transition matrix of 
the associated de Bruijn graph. Nevertheless, the eigenvalue method discussed 
in the proof of Theorem 7.3.3 can be directly applied to derive limit laws of 
N,(W) for general set of patterns W. We shall discuss it next. 

To study asymptotics of N,,(W) we need to estimate the growth of T”(u) 
which is governed by the growth of the largest eigenvalue, as we have already 
seen in the previous sections. Here, however, the situation is a little more 
complicated since the matrix T(w) is irreducible but not necessary primitive 
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(cf. Chapter 1 for in depth discussion). To be more precise, T(w) is irreducible 
if its associated de Bruijn graph is strongly connected, while for primitivity of 
T(u) we require that the greatest common divisor of the cycle weights of the de 
Bruijn graph is equal to one. 

Let us first verify irreducibility of T(u). As easy to check the matrix is 
irreducible since for any g > M and b,b € A™ there are two words w,v € A 
such that bw = vb (e.g., for g = M one can take w = b and v = b). Thus 
T9(u) > 0 for u > 0 which is sufficient for irreducibility. 

Let us now have a closer look at the primitivity of T(w). We start with a 
precise definition. Let «(b, b) := ¢(ba) where ba +> be the counter value when 
transitioned form b to b. Let also C be a cycle in the associated de Bruijn graph. 
Define the total weight of the cycle C as 


b(C) = >> ¥(6, 8). 


b,bEC 


Finally, we set yw = gced(w(C) : C cycle). If yw = 1, then we say T(w) is 
primitive. 


EXAMPLE 7.3.4 (continued). Consider again the matrix T(w) and its associ- 
ated graph shown in Figure 7.2. There are six cycles of respective weights 
0,3,2,0,0,1, therefore Wy = 1 and T(u) is primitive. 

Consider now another matrix 


P(a) P(b)ut 
a) = és u2 P(b) “3 ) 


This time there are three cycles of weights 0,6 and 3 and wy = 3. The matrix 
is not primitive. Observe that the characteristic polynomial A(u) of this matrix 
is a polynomial in u?. 

Observe that the diagonal elements of T(u)* (i-e., its trace) are polynomials 
in u’ if and only if £ divides ww; therefore, the characteristic polynomial det (zI— 


T(u)) of T(u) is a polynomial in u?. Indeed, it is known that for any matrix 


Tr[A*] 
det(I — A) = exp a re 
k>0 
where Tr[A] is the trace of A. 

Asymptotic behavior of the generating function N,,(w) = E(u%)), hence 
N,,(W), depends on the growth of T”(u). The next lemma summarizes some 
useful properties of T(w) and its eigenvalues. For the matrix T(u) of dimension 
|A|™“ x |A|™ we denote by \j(u) for 7 = 1,...,R = |A™| its eigenvalues and we 
assume that |Ai(u)| > |Ag(u)| > --- > |Ar(u)|. To simplify notation, we often 
drop the index of the largest eigenvalue, that is, A(w) := A1(w). Observe that 
o(u) = |A(u)| is known as the spectral radius and it is equal to 


o(t) = lim ||"(u)||/" 
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where ||: || is any matrix norm. 


LEMMA 7.3.6. Let Gy(W) and T(u) denote, respectively, the de Bruijn graph 
and its associated matrix defined in (7.3.10) for general pattern W. Assume 
P(W) > 0. 


(i) For u > 0 the matrix T(u) has a unique dominant eigenvalue X(u) (> A;(u) 
for j = 2,...,|A|™) that is strictly positive and a dominant eigenvector r(u) 
whose all entries are strictly positive. Furthermore, there exists a complex 
neighborhood of the real positive axis on which the mappings u — (wu) and 
u — r(u) are well-defined and analytic. 


(ii) Define A(s) := log \(e*) for s complex. For real s the function s — A(s) is 
strictly increasing and strictly convex. In addition, 


A(0)=1, A’(0)=P(W)>0, = A”(0) :=0?(W) > 0. 
(iii) For any 6 € (0,27) and x real o(xe"®) < o(z). 


(iv) For any @ € (0,27), if yw = 1, then for x real o(xe"’) < o(x); otherwise 
dw =d> 1 and o(xe’) = o(2) if and only if 0 = 2kr/d. 


Proof. We first prove (i). Take u > 0 real positive. Then the matrix T(u) 
has positive entries, and for any exponent g > M the gth power of T(w) has 
strictly positive entries, as shown above (see irreducibility of T(w)). Therefore, 
by the Perron—Frobenius theorem (cf. also Chapter 1) there exists an eigenvalue 
A(u) that dominates strictly all the others. Moreover, it is simple and strictly 
positive. In other words, one has 


Nu) = Ar(u) > |Ae(w)| 2 |As(u)| 2 ---- 


Furthermore, the corresponding eigenvector r(w) has all its components strictly 
positive. Since the dominant eigenvalue is separated from other eigenvalues, by 
perturbation theory there exists a complex neighborhood of the real positive 
axis where the functions u > A(u) and u — r(u) are well-defined and ana- 
lytic. Moreover, A(w) is an algebraic function since it satisfies the characteristic 
equation det(AI — T(w)) = 0. 

We now prove part (ii). The increasing property for A(u) (and thus for A(s)) 
is a consequence of the fact that if A and B are nonnegative irreducible matrices 
such that A;,; > B;,; for all (2,7), then the spectral radius of A is larger than 
the spectral radius of B. 

For convexity of A(s), it is sufficient to prove that for u,v > 0 


AVuv) < VAu)VACv). 


Since eigenvectors are defined up to a constant, one can always choose the 
eigenvectors r(/uv), r(u), and r(v) such that 


ri(uv) 


max ——=———=— _ = 1. 
a ri(u) r;(v) 
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Suppose that this maximum is attained at some index 7. We denote by Pi; the 
coefficient at u in T(u), that is, Pj; = [u”][T(u)]i;- By the Cauchy-Schwarz 
inequality we have 


A a)ra( fame) = J Py (aa) "OP 15a) 


= Ds Pig (Vu)? \/15(u) r5(v) 


1/2 1/2 


Py; yved) rj (w) > Pij yes) rj (v) 
VAC) VAC) Vri(u) rar), 


which implies convexity of A(s). To show that A(s) is strictly convex, we argue 
as follows: Observe that for u = 1 the matrix T(u) is stochastic, hence \(1) = 1 
and A(0) = 0. As we shall see below, the mean and the variance of N,,(W) 
are equal asymptotically to nA’(0) and nA”(0), respectively. From the problem 
formulation, we conclude that A’(0) = P(W) > 0 and A”(0) = 0?(W) > 0. 
Therefore, A’(s) and A”(s) cannot be always 0 and (since they are analytic) 
they cannot be zero on any interval. This implies that A(s) is strictly increasing 
and strictly convex. 


IA 


We now establish part (iii). For |u| = 1, and x real positive, consider two 
matrices T(x) and T(#wu). From (i) we know that for T(z) there exist a dominant 
strictly positive eigenvalue \ := (a) and a dominant eigenvector r := r(x) 
whose all entries r; are strictly positive. Consider an eigenvalue v of T(au) and 
its corresponding eigenvector s := s(w). Denote by v; the ratio s;/r;. One can 
always choose r and s such that max;<;<pr |v;| = 1. Suppose that this maximum 
is attained for some index 7. Then 


\vs;| = | \ Pi; (cu)? OI) 8; << a Pij vid) r= AT;- (7.3.14) 
J J 


We conclude that |v| < A, and part (iii) is proven. 


Finally we deal with part (iv). Suppose now that the equality |v| = A holds. 
Then, all the previous inequalities in (7.3.14) become equalities. First, for all 
indices ¢ such that Pic 4 0, we deduce that |se| = rg, and ve has modulus 1. 
For these indices ¢, we have the same equalities in (7.3.14) as for i. Finally, 
the transitivity of the de Bruijn graph entails that that each complex v; is of 
modulus 1. Now, the converse of the triangular inequality shows that for every 
edge (i,7) € Gu(W) we have 


wWEDy, = st 
and for any cycle of length DL we conclude that 


(4) — vO). 
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However, for any pattern W there exists a cycle C of length one with weight 
w(C) = 0, as easy to see. This proves that v = \ and that u¥©) = 1 for any 
cycle C. If ww = gcd(w(C), C cycle) = 1, then u = 1 and oe(ze"’) < o(x) for 


6 € (0, 27). 
Suppose now that Wy =d> 1. Then, the characteristic polynomial and the 
dominant eigenvalue \(v) are functions of v4. The lemma is proved. rT 


Lemma 7.3.6 provides the main technical support to prove the forthcoming 
results; in particular, to establish asymptotic behavior of T"(u) for large n. In- 
deed, our starting point is (7.3.13) to which we apply the spectral decomposition 
as in (7.3.9) to conclude that 


N(z,u) = =O + 2 Slt) 


i>2 1 — zdA;(u)) 


where a; > 1 are some integers. In the above, \(w) is the dominant eigenvalue, 
while \;(u) < A(u) are other eigenvalues. The numerator has the expression 
c(u) = b'(u)(I(w), T)r(u) where I(u) and r(u) are the left and the right domi- 
nant eigenvectors and b‘(u) is defined after (7.3.12). Then Cauchy’s coefficient 
formula implies 

N,(u) = c(u)A"(u) (1 + O(A")) (7.3.15) 


for some A < 1. Equivalently, the moment generating function for N,,(W) is 
given by the following uniform approximation in a neighborhood of s = 0 


E(e°%")) = d(s)\"(e*)(1 + O(A”)) = d(s) exp (nA(s)) (1 + O(A")) (7.3.16) 


where d(s) = c(e*) and A(s) = log X(e’). 

There is another, more general, derivation of (7.3.15). Observe that the 
spectral decomposition of T(w) when wu lies in a sufficiently small complex neigh- 
borhood of any compact subinterval of (0,-+0o) is of the form 


T(u) = A(u)Q(u) + R(u) (7.3.17) 


where Q(u) is the projection under the dominant eigensubspace and R(wu) a 
matrix whose spectral radius equals |A2(u)|. Therefore, 


T(u)” = A(u)"Q(u) + R(u)”, 
entails the estimate (7.3.15). The next results follows immediately from (7.3.16). 


THEOREM 7.3.7. Let W = (Wo,Wi,...;Wa) be a generalized pattern with 
Wo = @ generated by a memoryless source. For large n 


E(N,(W)) = nA‘(0) + O(1) = nP(W) + O(1), (7.3.18) 
Var(Nn(W)) = nA"(0) + O(1) = no?(W) + O(1) (7.3.19) 


where A(s) = log \(e*) and A(u) is the largest eigenvalue of T(u). Furthermore, 
P(N, (W) = 0) = CA"(0)(1 + O(A")) 


where C > 0 is a constant and A < 1. 


Version June 23, 2004 


362 Analytic Approach to Pattern Matching 


Now we establish limit laws, starting with the central limit law and its local 
limit law. 


THEOREM 7.3.8. Under the same assumption as for Theorem 7.3.7, the fol- 
lowing holds 


where B is a bounded real interval. 


sup 
«eB 


Proof. The uniform asymptotic expansion (7.3.16) of a sequence of moment 
generating functions is known as a “quasi-powers approximation”. Then an ap- 
plication of the classical Levy continuity theorem leads to the Gaussian limit 
law. An application of the Berry-Essen inequality provides the speed of conver- 
gence which is O(1/\/n). This proves the theorem. rT 


Finally, we deal with the large deviations. 


THEOREM 7.3.9. Under the same assumption as before, Let wg be a solution 
of 
wAr'(w) = aX(w) 


for some a 4 P(W), where A(u) is the largest eigenvalue of T(u). Define 
I(a) = alog we — log A(wa). (7.3.21) 


Then there exists a constant C' > 0 such that I(a) > 0 fora #4 P(W) and 


lim . logP(N,(W) <an)=—-I(x) if O0<a<P(W) (7.3.22) 


noo 1 


1 
lim —logP(N,(W) >na)=-I(x) if PW) <a<C. (7.3.23) 
no nN 
Proof. We consider now large deviations and establish (7.3.22). The variable 
N,,(W) is by definition of at most linear growth, and there exists a a constant 
C such that N,(W) < Cn+O(1). Let 0 < « < P(W). Cauchy’s coefficient 
formula provides 


1 N,(u) du 

P(N,(W) < k) = — ——_ ——_.. 

(Nn(W) $k) 2im Jiuiar UF ou(1 — wu) 

For ease of exposition, we first discuss the case of a primitive pattern. We 
recall that a pattern is primitive if wy = gcd(w(C), C cycle) = 1. The strict 
domination property expressed in Lemma 7.3.6(iv) for primitive patterns implies 
that the above integrand is strictly maximal at the intersection of the circle |u| = 
r and the positive real axis. Near the positive real axis, where the contribution 
of the integrand is concentrated, the following uniform approximation holds, 
with k = na: 


uk EXP (n (log A(w) — alog u)) (1 + o0(1)) (7.3.24) 
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The saddle point equation is then obtained by cancelling the first derivative 
yielding 
wr (w) 
F = =a. 3.2 

(w) Ga) a (7.3.25) 
Note that the function F' is exactly the derivative of A(s) at point s := logw. 
Since A(s) is strictly convex, the left side is an increasing function of its argument 
as proved in Lemma 7.3.6(ii). Also, we know form this lemma that the value 
F (0) =0, F(1) = P(W) while we set F(oo) = C. Thus, for any real a in (0,C), 
equation (7.3.25) always admits a unique positive solution that we denote by 
wW = W,. Moreover, for a # P(W), one has we # 1. Since the function 


A(u) 


Ue 


u— —log 


admits a strict maximum at wu = Wa, hence this maximum I(a) is strictly posi- 
tive. Finally, the usual saddle point approximation applies and one finds 


where O(n) is of the order of n~!/?. In summary, the large deviation rate is 


(wa) : war (wa) 
g with ie a. 
as shown in the theorem. 

In the general case when the pattern is not primitive, the strict inequality 
of Lemma 7.3.6(iv) is not satisfied, and several saddle points may be present 
on the circle |u| = r, which will lead to some oscillations. We must, in this 
case, use the weaker inequality of Lemma 7.3.6, namely, o(xe”) < o(x), which 
replaces the strict inequality. However, the factor (1 — u)~! present in the 
integrand of (7.3.24) attains its maximum modulus on |u| = r solely at u =r. 
Thus, the contribution of possible saddle points can only affect a fraction of the 
contribution from u = r. Consequently, (7.3.22) and (7.3.21) continue to be 
valid. A similar reasoning provides the right tail estimate, with I(a) still given 
by (7.3.21). This completes the proof of (7.3.22). rT 


We complete this analysis with a local limit law. 


THEOREM 7.3.10. If T(u) is primitive, then 


= 1 @&? 1 

sup |P (N, = nP(W) + xa(W) Vn) —- ——= S| = 0 | = 7.3.26 

sp (Sa = nPOW) + 200A) - Soe ae ~ (Fm) 920 

where B is a bounded real interval. Furthermore, under the above additional 
assumption, one can find constants og and dq such that 


P(N,(W) = aE(N,)) ~ ——e 2) 80 (73.27) 


where I(a) is defined in (7.3.21) above. 
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Proof Stronger “regularity conditions” are needed in order to obtain local limit 
estimates. Roughly, one wants to exclude the possibility that the discrete dis- 
tribution is of a lattice type, being supported by a nontrivial sublattice of the 
integers. (For instance, we need to exclude the possibility for N,(W) to be 
always odd, or of the parity of n, and so on.) Observe first that positivity and 
irreducibility of the matrix T(u) is not enough as shown in Example 7.3.4. 

By Lemma 7.3.6, one can estimate the probability distribution of N,,(W) by 
the classical saddle point method in the case when W is primitive. Again, one 
starts from Cauchy’s coefficient integral, 


P(N, (W) =k) = — Ny(u) a (7.3.28) 


217 ju|=1 


where k is of the form k = nP(W)n+a0(W)J/n. Property (iv) of Lemma 7.3.6 
grants us precisely the fact that any closed arc of the unit circle not containing 
u = 1 brings an exponentially negligible contribution. A standard application of 
the saddle point technique does the job. In this way, the proof of the local limit 
law of Theorem 7.3.10 is completed. Finally, the precise large deviations follows 
from the local limit result and an application of the method of shift discussed 
in the proof of Theorem 7.2.12. rT 


7.3.3. Forbidden words and (¢,k) sequences 


Finally, consider the general pattern W = (Wo,W1,...,Wa) with nonempty 
forbidden set Wo. In this case, we study the number of occurrences N,,(W|Wo = 
0) of patterns W,,... Wa under the condition that there is no occurrence in the 
text of any pattern from Wo. 

Fortunately, we can recover almost all results from our previous analysis after 
re-defining the matrix T(w) and its de Bruijn graph. We now change (7.3.10) to 


[T(u)], 5 = P(aju®™ [ba € Ab and ba ¢ Wo] (7.3.29) 


where ba C Wo means that any subword of ba belongs to Wo. In words, we 
force the matrix T(w) to be zero at any position that leads to a word containing 
patterns from Wo, that is, we eliminate from the de Bruijn graph any transition 
that contains a forbidden word. Having matrix T(w) constructed, we can repeat 
all previous results except that it is much harder to find explicit formulas even 
for the mean and the variance (cf. Exercise 7.3.4) 

Finally, we consider a degenerated general pattern in which W; = @ for 
alli = 1,...,d except nonempty Wo. In this case, we count the number of 
sequences that do not contain a pattern from Wo. We only consider the special 
case of this problem, that of (¢,k) sequences for which Wo is defined in (7.3.1). 
In particular, we compute the so called Shannon capacity Cy, defined as 


cee log(number of (¢,k) sequence of length nm) 
n—0o nm 
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We first compute the ordinary generating function Ty,4(z) = 0,,¢7,,, 2!”! of 
all (¢,k) words denoted as T¢;,. To enumerate Tp, we define Dg, as the set of 
all words consisting only of runs of 0’s whose length is between @ and k. The 
generating function D(z) is clearly equal to 


i k—l+1 
D(z) = 224261 4+...42% = 2° = 
—2Zz 


We now observe that Je, can be symbolically written as 
Tok = Der ({1} xeE+ Der + Der x Der a Dik +:: -) ; (7.3.30) 


where Dy, = {1} x Dex. Above basically says that the collection of (@,k) 
sequences, Je,, is a concatenation of {1} x Dey. Thus (7.3.30) translates into 
the generating functions Ty ,(z) as follows 


1 _ 2° grri=t) 
1—z2zD(z) 1-2-2614 2h? 


wil die gl ahs jcc gh 


Ten (Z) => D(z) 


ee (7.3.31) 
Then Shannon capacity Ce, is 
log|z"|T; 
Cor = lim logl2"|Te x (2) ew(2) 
n—0o n 
If p is the smallest root in absolute value of 1 — z¢+! — 24+? ge =), 


then clearly 
Cex = — log p. 


EXAMPLE 7.3.11. In this example, we show that one can enumerate more 
precisely (¢,k) sequences. In fact, since the function Ty ;,(z) is rational we can 
compute [z"]Ty,(z) exactly. Let us consider a particular case, namely, ¢ = 1 
and k = 3. Then the denumerator in (7.3.31) becomes 1 — z? — z3 — 24, and its 
roots are 


p-1=—1, po = 0.682327..., 1 = —0.341164...471.161541..., po =f. 


Computing residues we obtain 


r pot po + 0% ae 
2 ial?) = >— 7 _ 11 
[2"]T13(z) (p1 + 1)(p0 — p1)(Po — pr) : 
1 
+(-1)""! OU”); 


(po + 1)(p1 + 1)(A1 + 1) 


where r ¥ 0.68. More specifically, 


[z"]T1,3(z) = 0.594(1.465)"* + 0.189(—1)"*1 + O(0.68"). 


for large n. 
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7.4. Subsequence pattern matching 


In string matching problem, given a pattern W one searches for some/all oc- 
currences of W as a block of consecutive symbols in a text. We analyzed var- 
ious string matching problems in the previous sections. Here we concentrate 
on subsequence pattern matching. In this case we search for a given pattern 
W = WwW ,W2...Wm in the text X = 71 7%2...%p as a subsequence, that is, we look 
for indices 1 < 41 < t2 < +++ < tm <n such that 2, = wi, vi, = Wa, --s, 
Xi, = Wm. We also say that the word W is “hidden” in the text; thus we call 
this the hidden pattern problem. For example, date occurs as a subsequence in 
the text hidden pattern, in fact four times, but not even once as a string. 

More specifically, we allow the possibility of imposing an additional set of 
constraints D on the indices 21, 7i2,...,%m to record a valid subsequence occur- 
rence. For a given family of integers d; (d; > 1, possibly d; = oo), one should 
have (i;41 —1;) <d;. More formally, the hidden pattern specification is deter- 
mined by a pair (W,D) where W = w1---Wm is a word of length m and the 
constraint D = (di,...,dm-—1) is an element of (N* U {oo})™71. 


EXAMPLE 7.4.1. With # representing a ‘don’t-care-symbol’ and the subscript 
denoting a strict upper bound on the length of the associated gap, a typical 
pattern may look like 


ab#or#acHa#td# za#tbr#a (7.4.1) 


where # = #. and #; is omitted; That is ‘ab’ should occur first contiguously, 
followed by ‘r’ with a gap of < 2 symbols, followed anywhere later in the text 
by ‘ac’, etc. 


The case when all the d,’s are infinite is called the (fully) unconstrained 
problem. When all the d;’s are finite, then we speak of the (fully) constrained 
problem. In particular, the case where all d; are equal to one reduces to the 
exact string matching problem. Furthermore, observe that when all d; < oo 
(fully constrained pattern), the problem can be treated as the generalized string 
matching discussed in Section 7.3. In this case, the general pattern W is a set 
consisting of all words satisfying the constraint D. However, if at least one dj; 
is infinite, then the techniques discussed so far are not well suited to handle 
it. Therefore, in this section, we develop new methods that make the analysis 
possible. 

If an m-tuple I = (i1,%2,...,im) (1 < i1 < ig < +++ < tm) satisfies the con- 
straint D with ij;41—i; < d;, then it is called a position tuple. Let P,(D) be the 
set of all positions subject to the separation constraint D, satisfying furthermore 
im <n. Let also P(D) = U,, Pn(D). An occurrence of pattern W subject to 
the constraint D is a pair (I, X) formed with a position I = (41, i2,...,im) of 
Pr(D) and a text X = x 2%2-+- Xp for which 2, = wi,2j, = We,...,Li,, = Wm- 
Thus, what we call an occurrence is a text augmented with the distinguished 
positions at which the pattern occurs. The number (2 of occurrences of pattern 
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W in text X as a subsequence subject to the constraint D is then a sum of 
characteristic variables 


YUX)= Yo 2y(X), (7.4.2) 
IEP\x\(D) 


where Z;(X) := [W occurs at position J in X]. When the text X is of length 
n, then we often write Q, := Q(X). 

In order to proceed we need to introduce important notion of blocks and 
aggregates. In the general case, we assume that the subset F of indices 7 for 
which d, is finite (dj < oo) has cardinality m — b with 1 < b < m. The two 
extreme values of b, namely, b = m and b = 1, describe the (fully) unconstrained 
and the (fully) constrained problem, respectively. Thus, the subset U/ of indices 
j for which d; is unbounded (d; = oo) has cardinality b— 1. It then separates 
the pattern W into b independent subpatterns that are called the blocks and 
are denoted by W,, W2,...W». All the possible dj “inside” any W,. are finite 
and form the subconstraint D,, so that a general hidden pattern specification 
(W, D) is equivalently described as a b-tuple of fully constrained hidden patterns 
(MW, D1), (Wo, D2), Pee (W,Do))- 


EXAMPLE 7.4.1 (continued). Consider again 
ab#or#acHa#d#yua#tbr#a, 
in which one has b = 6, the six blocks being 


Wy, =afibsfer, We = affic, W3= a, Wia= dftaa, Ws=bi#ir, We= a. 


In the same way, an occurrence position I = (i1,%2,...,im) of W subject to 
constraint D gives rise to b suboccurrences, J!],7?!,...7!, the rth term J”! 
representing an occurrence of W,. subject to constraint D,.. The rth block Bel 
is the closed segment whose end points are the extremal elements of I!"!, and 
the aggregate of position I, denoted by a(J), is the collection of these b blocks. 


EXAMPLE 7.4.1 (continued). Taking the pattern of Example 7.4.1, the position 
tuple 
I = (6,7, 9, 18, 19, 22, 30, 33, 50, 51, 60) 
satisfies the constraint D and gives rise to six subpositions, 
yu yi?l 783) 7{41 75) {6 
OO i oo Oe 
(6,7,9), (18,19), (22), (30,33), (50,51), (60) ; 
accordingly, the resulting aggregate a(J), 
Bul Bil Bi) Bil Bll Biél 
\ celine, EEE <i VE lo, copia EE cepa ee oe 
[6,9], [18,19], [22], [80,33], [50,51], [60] , 


is formed with six blocks. 
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7.4.1. Mean and variance analysis 


Hereafter, we assume that W is given and the text X is generated by a (non- 
degenerate) memoryless source. The first moment of the number of occurrences, 
Q(X), is easily obtained by describing the collection of all occurrences in terms 
of formal languages, as already discussed in previous sections. We consider the 
collection of position-text pairs 


O:={U,X) ; Le Pix\(DP)}, 


with the size of an element being by definition the length n of the text X. The 
weight of an element of O is taken to be equal to Z;(X)P(X), where P(X) is 
the probability of the text. In this way, O can also be regarded as the collection 
of all occurrences weighted by probabilities of the text. The corresponding 
generating function of O equipped with this weight is 


Oz)= Sl 2(X)P(X) z= 5° S> Z1(X) | P(X)z!*!, (7.4.3) 


(,X)eO X  \rePjx\(D) 


and, with the definition of Q, 


O(z) => O(X)P(X) 2"! => E(O,) 2. (7.4.4) 
xX n 


As a consequence, one has [z”]O(z) = E(Q,,), so that O(z) serves as the gener- 
ating function of the sequence of expectations E(Q,,). 

On the other hand, each occurrence can be viewed as a “context” with an 
initial string, then the first letter of the pattern, then a separating string, then 
the second letter, etc. The collection O is therefore described combinatorially 
by 


O =A* x {wy} x AS® x {wo} x AS® x... x {wm_1} x AS4™-1 x {wm} x A*. 
(7.4.5) 
There, for d < 00, A<“ denotes the collection of all words of length strictly less 
d, ie., AX? := U;eqA’, whereas, for d = 00, A<~ denotes the collection of 
all finite words, ie., AX := A* = U,-,, A’. Since the source is memoryless, 
the rules discussed at the end of the last section can be applied, and they give 
access to O(z) from the description (7.4.5). The generating function functions 
associated to A<¢ and A<© are 
1— z¢ 


1 
AX<4(z) =1+z2t+274.-42¢ 7 = i ; A<(2) = 14+24+274--+= : 
—Z —Z 


Thus, the description (7.4.5) of occurrences automatically translates into 


1 b+1 m 1 — 24 
O(z) = $7 E[N,] 2” = (=) x (l<) x (1 = i. (7.4.6) 


n>0 iG F 
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One finally finds 


E(Qn) = [z"]O(2) = a (1 ‘) PW) (1 +0 (<)) (7.4.7) 


1EF 


and a complete asymptotic expansion could be easily obtained. 
For the analysis of variance and especially of higher moments, it is essential 
to work with a centered random variable = defined, for each n, as 


En i=, —E(Q%)= > Yi, (7.4.8) 
IEPn(D) 


where Y7 := Z; — E(Z;) = Z; — P(W). The second moment of the centered 
variable = equals the variance of ,, and with the centered variables defined 
above by (7.4.8), one has 


E(E2)= >> B(¥rY5). (7.4.9) 
I,JEPn(D) 


From this last equation, we need to analyze pairs of positions (I, X),(J,X) = 
(I, J, X) relative to a common text X. We denote by Oz this set, that is, 


O2:={U,J,X) ; I,J © Pix(D)}, 


and we weight each element (I, J, X) by Yr(X)Y;(X)P(X). The corresponding 
generating function, which enumerates pairs of occurrences, is 


Ox(z):= SO ¥i(X)¥y(X) P(X) 2" 
(1,J,X)€O2 


=f SY weyysxy} pope 


X \F,JEP|x\(D) 


and, with (7.4.9), 


Olg=S~ Y. BWY)2? =)" Es" 


n>0 1,JE€Pn(D) n>0 


The process entirely parallels the derivation of (7.4.3) and (7.4.4), and, one has 
[2"]O2(z) = E(?), so that O2(z) serves as the generating function (in the usual 
sense) of the sequence of moments E(=?). 

There are two kinds of pairs (J, J) depending whether they intersect or not. 
When J and J do not intersect, the corresponding random variables Y; and 
Y7 are independent, and the corresponding covariance E[Y;Y7] reduces to 0. 
As a consequence, one may restrict attention to pairs of occurrences J, J that 
intersect at one place at least. Suppose that there exist two occurrences of 
pattern W at positions J and J which intersect at ¢ distinct places. We then 
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a Sf eae 


Figure 7.3. A pair of position tuples I, J with b = 6 blocks each and the 
joint aggregates; the number of degrees of freedom is here 3(I, J) = 4. 


denote by Wrz the subpattern of W that occurs at position IM J, and by 
P(Winz) the probability of this subpattern. Since the expectation E(Z;Z,) 
equals P(W)?/P(Wyn,7) provided that W agrees on every position of 1M J, the 
expectation E(Y;Y,) = P(W)?e(I, J) involves a correlation number e(I, J) 


e(I, J) = aa =r (7.4.10) 


Remark that this relation remains true even if the pair (J, J) is not intersecting, 
since, in this case, one has P(Wj7) = P(e) = 1. 

The asymptotic behavior of variance is driven by the overlapping of blocks 
involved in J and J, rather than plainly by the cardinality of 1M J. In order to 
formalize this, define first the (joint) aggregate a(I, J) to be the system of blocks 
obtained by merging together all intersecting blocks of the two aggregates a(J) 
and a(J). The number of blocks 6(/, J) of a(J, J) plays a fundamental réle 
here, since it measures the degree of freedom of pairs; we also call G(I, J) the 
degree of pair (J, J). Figure 7.3 illustrates graphically this notion. 


EXAMPLE 7.4.2. Consider the pattern W = [attsbitar ]#[ aac] composed of 


two blocks. Then the text aarbarbccaracc contains several valid occurrences 
of W including two at positions J = (2,4,6,10,13) and J = (5,7,11,12,13). 
The individual aggregates are a(I) = {[2,6],[10,13]},a(J) = {[5, 11], [12, 13]} 


so that the joint quantities are: a(J, J) = [2,13] and GU, J) =1. This pair has 
exactly degree 1. 


When J and J intersect, there exists at least one block of a(J) that intersects 
a block of a(J), so that the degree G(I, J) is at most equal to 2b— 1. Next, we 
partition Oz according to the value of G(J, J) and write 


OF! = {(,J,X) €O2 ; BUI, J) =2b-p} 
for the collection of intersecting pairs (J, J,X) of occurrences for which the 
degree of freedom equals 2b — p. From the preceding discussion, only p > 1 
needs to be considered and 


O2(2) = Of'(2) + OF'(z) + ON (2) +++ + OF"(2). 


As we see next, it is only the first term of this sum that matters asymptotically. 
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In order to conclude the discussion, we need the notion of full pairs: a pair 
(I,J) of Py(D) x Pq(D) is full if the joint aggregate a(I, J) completely covers 
the interval [1,q]; see Figure 7.4. (Clearly, the possible values of length q are 
finite, since q is at most equal to 20, where ¢ is the length of the constraint D.) 


EXAMPLE 7.4.3. Consider the pattern W = a#sb#4r#fadiac. The text - 
aarbarbccaracc also contains two other occurrences of W, at positions I’ = 
(1, 4,6, 12,13) and J’ = (5,7,11,12,14). Now, J’ and J’ are intersecting, and 
the aggregates are a(I’) = {[1, 6], [12,13]},a(J’) = {[5,11], [12,14]} so that 
a(I’, J’) = {[1, 11], [12,14]. We have here an example of a full pair of occur- 
rences with a number of blocks G(I’, J’) = 2. 


There is a fundamental translation invariance due to the independence of 


symbols in the Bernoulli model that entails a combinatorial isomorphism (= 
represents combinatorial isomorphism) 


ofa? xB. 


where pe ] is the subset of Oz formed of full p airs such that (J, J) equals 2b—p. 
In essence, the gaps can be all grouped together (their number is 2b—p+1, which 
is translated by the prefactor (A*)”’-?*'), while what remains constitutes a full 


occurrence. The generating function of oP lis accordingly 


1 2b—p+1 
opla)=() x ah) 


where BP I(z) is the generating function of the collection pe 1 From our earlier 
discussion, it is a polynomial. Now, an easy dominant pole analysis entails 
that [z”"]O”! = O(n2’-P). This proves that the dominant contribution to the 
variance is given by [z"]O!, which is of order O(n25-?). 

The variance E(=?) involves the constant Bu (1) that is the total weight of 
the collection BH. Recall that this collection is formed of intersecting full pairs 
of occurrences of degree 2b — 1. The polynomial Bu (z) is itself the generating 
function of the collection BH, and it is conceptually an extension of Guibas and 
Odlyzko’s autocorrelation polynomial. We shall later make precise the relation 
between both polynomials. 

We summarize our findings in the following theorem. 


— Ey) LAZY, sf) 


Figure 7.4. A full pair of position tuples J, J with b = 6 blocks each. 
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THEOREM 7.4.4. Consider a general constraint D with a number of blocks 
equal to b. The mean and the variance of the number of occurrences 2, of a 
pattern W subject to constraint D satisfy 


E(Qn) -a( Il a) a (14000), 


jidj<oo 


Var(Qn) = 0?(W)n??! (1+ O(n7")), 
where the “variance coefficient” o?(W) involves the autocorrelation K(W) 


2 P?(W) > . SA a 
a (W) = bi * (W) with W(W):= So e(I,J) (7.4.11) 
a, JeBl! 


The set Bt is the collection of all pairs of position tuple (I, J) that satisfy three 
conditions: (i) they are full; (ii) they are intersecting; (iti) there is a single pair 
(r,s) with 1 < r,s < b for which the rth block B'! of a(I) and the sth block 
C!s) of a(J) intersect. 


The computation of the autocorrelation «(W) reduces to 6? computations 
of correlations K(W,, Ws), relative to pairs (W,, Ws) of blocks. Note that each 
correlation of the form «(W,, Ws) involves a totally constrained problem and is 


discussed below. Let D(D) := |], g,2.. di. Then, one has 


2M)=PR) DT pers (AT), *) Orn. 


1l<r,s<b 
(7.4.12) 
where K(W,, Ws) is the sum of the e(J, J) taken over all full intersecting pairs 
(1, J) formed with an position tuple I of block W, subject to constraint D, and 
an position tuple J of block W, subject to constraint D,. Let us explain the 


formula (7.4.12) in words: for a pair (I, J) of the set By, there is a single pair 
(r, s) of indices with 1 < r,s <b for which the rth block Bl") of a(J) and the sth 
block Cls! of a(J) intersect. Then, there exist r +s — 2 blocks before the block 
a( Bll, cls!) and 2b—r — s blocks after it. We then have three different degrees 
of freedom: (i) the relative order of blocks Bl! (i < r) and blocks CUl(j < s), 
and similarly the relative order of blocks Bl! (i > r) and blocks C¥l(j > s); (ii) 
the lengths of the blocks (there are D; possible lengths for the jth block); (iz) 
finally the relative positions of the blocks Bl"! and Cll. 

In particular, in the unconstrained case, the parameter b equals m, and each 
block W, is reduced to the symbol w,. Then the “correlation coefficient” «?(W) 
simplifies to 


K2(W) = . (" pes *) ar ‘) [wr = ws] (— = 1) . (74.18) 


In words, once you fix the position of the intersection, called pivot, then amongst 
the r+ .s— 2 elements smaller than the pivot one assigns freely r — 1 to the first 
occurrence and the remaining s — 1 to the second. One proceeds similarly for 
the 2m — r — s elements larger than the pivot. 
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7.4.2. Autocorrelation polynomial revisited 


Finally, we compare the autocorrelation coefficient «(W) with the autocorrela- 
tion polynomial S',,(z) introduced in the last section for the exact string match- 
ing problem. Let now w = w iwe...Wm be again a string of length m, and all 
the symbols of w must occur at consecutive places, so that a valid position I is 
an interval of length m. We recall that the autocorrelation set P(w) C [1..m] 
involves all indices k such that the prefix wf coincides with the suffix w??_,..,. 
Here, an index k € P(w) is relative to a intersecting pair of positions (I, J) and 
one has wk = wry. 

In the previous section, we introduced the autocorrelation polynomial S,,(z) 
as 


Sw(z) = S> PlwR)2e"-* = P(w) YO 1 _ gmk, 


k 
kePw keEP(w) P(wy) 


We also define 


Since the polynomial By involves coefficients e(I, J) this polynomial can be 
written as function of the two autocorrelations polynomials A,, and Cy, 


By! (2) = P(w)2™ [Aw(2) — P(w) Cu(2)]- 
Put simply, the variance coefficient of the hidden pattern problem extends the 
classical autocorrelation quantities associated with strings. 


7.4.3. Central limit laws 


Our goal is to prove that the sequence 2, appropriately centered and scaled 
tends to the normal distribution. We consider the following standardized ran- 
dom variable =,, which is defined for each n by 

Ae QO, — E(Q,) 


nia ate = a (7.4.14) 


tik 


where b is the number of blocks of the constraint D. We shall show that = 
behaves asymptotically as a normal variable with mean 0 and standard devia- 
tion o. By the classical moment convergence theorem this is established once 
all moments of =, are known to converge to the appropriate moments of the 
standard normal distribution. 

We remind the reader that if G is a standard normal variable (i.e., a Gaus- 
sian distributed variable with mean 0 and standard deviation 1), then for any 
integral s > 0 


E(G**) =1-3--- (2s —1), E(G?*+") = 0. (7.4.15) 
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We shall accordingly distinguish two cases based on the parity of r, r = 2s and 
r = 2s+1, and prove that 


E[E2s+}] = o(n2st1)@-1/2)) E(=2*) ~ 25 (ie Faex(og-1)} n2so-s 
‘ (7.4.16) 
which implies Gaussian convergence of ©. 


THEOREM 7.4.5. The random variable 2,, over a random text of length n 
generated by a memoryless source asymptotically obeys a Central Limit Law in 
the sense that its distribution is asymptotically normal: for all x = O(1), one 


has 
Qn —E Qn . _ 
fin fo SE 2s \ — | et /2 at. (7.4.17) 
n— 00 Var (Qn) V2T Joo 


Proof. The proof below is combinatorial; it basically reduces to grouping and 
enumerating adequately the various combinations of indices in the sum that 
expresses E(&7). Once more, P,(D) is formed of all the sets of positions in 
[1,n] subject to the constraint D and we set P(D) := U,, Pn(P). Then totally 
distributing the terms in =” yields 


E(=") = a E(¥;, ---Y7,). (7.4.18) 
Uivas I,)EPT(D) 


An r-tuple of sets (11,..., I.) in P”(D) is said to be friendly if each I; intersects 
at least one other Ip, with € 4 k and we let Q‘")(D) be the set of all friendly 
collections in P"(D). For P”, Q“), and their derivatives below, we add the 
subscript n each time the situation is particularized to texts of length n. If 
(1,...,1,) does not lie in Q%)(D), then E(Y7, ---Yz,) = 0, since at least one 
of the Y7’s is independent of the other factors in the product and the Y7’s have 
been centered, E(Y7) = 0. One can thus restrict attention to friendly families 
and get the basic formula 


E(=") = . E(Y;, sas Yi); (7.4.19) 
(Thy. )E QW? (D) 


where the expression involves fewer terms than in (7.4.18). From there, we 
proceed in two stages. First, restrict attention to friendly families that give rise 
to the dominant contribution and introduce a suitable subfamily Qs” c Qt). 
in so doing, moments of odd order appear to be negligible. Next, for even 
order r, the family Qs” involves a symmetry and it suffices to consider another 
smaller subfamily Qh) Cc Q® that corresponds to a “standard” form of position 
tuple intersection; this last reduction precisely gives rise to the even Gaussian 
moments. 


Odd moments. Given (I;,...,I,) € Q ), the aggregate a(Iy, I2,...,1,) is 
defined as the aggregation (in the sense of the variance calculation above) of 
a(Iy) U-+-Ua(Z,). Next, the number of blocks of (1,...,[,-) is the number 
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of blocks of the aggregate a(1,...,[,-); if p is the total number of intersecting 
blocks of the aggregate a(l,...,1,), the aggregate a(l), Io,...I,) has rb — p 
blocks. Like previously, we say that the family (1,..., I) of Ql) is full if the 
aggregate a(Iy,I2,...I,) completely covers the interval [1,q]. In this case, the 
length of the aggregate is at most rd(m — 1) + 1, and the generating function 
of full families is a polynomial P,(z) of degree at most rd(m — 1) +1 with 
d= maxje¢d;. Then, the generating function of families of Q(”) whose block 
number equals & is of the form 


()7 «20, 


i. 


so that the number of families of Q;’ whose block number equals k is O(n*). 
This observation proves that the dominant contribution to (7.4.19) arises from 
friendly families with a maximal block number. It is clear that the minimum 
number of intersecting blocks of any element of Q‘) equals [r/2], since it co- 
incides exactly with the minimum number of edges of a graph with 7 vertices 
which contains no isolated vertex. Then the maximum block number of a f 
friendly family equals rb — [r/2]. In view of this fact and the remarks above 
regarding cardinalities, we immediately have 


E [E244] ore (eet = (alee e-1/a) 


which establishes the limit form of odd moments in (7.4.16). 


Even Moments. We are thus left with estimating the even moments. The 
dominant term is relative to friendly families of Q(%) with an intersecting block 
number equal to s, whose set we denote by ov), In such a family, each subset 
I, intersects one and only one other subset Jg. Furthermore, if the blocks 
of a(Z;,) are denoted by BM <u < 5b, there exists only one block Blvel 
of a(J,) and only one block Blvd that contains the points of I, Ip. This 
defines an involution 7 such that 7(k/) = @ and r(¢) = k for all pairs of indices 
(¢,k) for which I, and Ip intersect. Furthermore, given the symmetry relation 
WR * gS EP it suffices to restrict attention to friendly 
families of or for which the involution 7 is the standard one with cycles (1, 2), 
(3,4), etc; for such “standard” families whose set is denoted by Qs), the pairs 
that intersect are thus (11, Iz), ..., (I2s—1, 125). Since the set K2, of involutions 
of 2s elements has cardinality Kj, = 1-3-5---(2s —1) , the equality 


p(2s) ) 


S> E(¥n, ++ ¥in,) = Kee S> E(¥n Yn); (7.4.20) 
olen (2s) 


KEN 


entails that we can work now solely with standard families. 
The class of position tuples relative to standard families is A* x (A*)?8°-$~! x 


BY) x A*; this class involves the collection BY) of all full friendly 2s-tuples of 


Version June 23, 2004 


376 Analytic Approach to Pattern Matching 


position tuples with a number of blocks equal to s. Since Bi is exactly a shuffle 


of s copies of Bu! (as introduced in the study of the variance), the associated 
generating function is 


1 Qsb—s+1 BH (2) 3 
(=) ono ( BE) 


where BY(z) is the already introduced autocorrelation polynomial. Upon tak- 
ing coefficients, we obtain the estimate 


S> E(¥y, +++ Ying) ~ nV 808, (7.4.21) 
fa on 
In view of the formulee (7.4.18), (7.4.19), (7.4.20), and (7.4.21) above, this yields 


the estimate of even moments and leads to the second relation of (7.4.16). This 
completes the proof of Theorem 7.4.5. ] 


The even Gaussian moments eventually come out as the number of involu- 
tions, which corresponds to a fundamental asymptotic symmetry present in the 
problem. In this perspective the specialization of the proof to the fully uncon- 
strained case is reminiscent of the derivation of the usual central limit theorem 
(dealing with sums of independent variables) by moments methods. 


7.4.4. Limit laws for fully constrained pattern 


In this section, we strengthen our results for fully constrained pattern in which 
all gaps dj are finite. We set D = |], dj, and = }7),d;. Observe that in this 
case, we can reduced the subsequence problem to a generalized string matching 
problem with the generalized pattern W consisting of all words that satisfy 
(W,P). Thus our previous results apply, in particular, Theorems 7.3.8 and 
7.3.10. This leads to the following result. 


THEOREM 7.4.6. Consider a fully constrained pattern with mean and variance 
found in Theorem 7.4.4 for b = 1. 


(i) The random variable ©, satisfies a Central Limit Law with speed of conver- 


gence 1/\/n: 
sup P oa 2 v) = = [. a i =O (=) . (7.4.22) 


(ii) Large deviations from the mean value have exponentially small probability: 
there exist a constant 7 > 0 and a nonnegative function I(«) defined throughout 
(0,7) such that I(x) > 0 for # DP(W) and 


1 Qn 
lim —logP | — <a) =—I (a) if0 <a < DP(W) 
ia i 0. ; (7.4.23) 
lim —logP ( — >a) = —I(ax) if DP(W) <2<7 
noo 1) nm 
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except for at most a finite number of exceptional values of x. More precisely, 


N 
I(x) = —log Ad) with ¢ = C(x) defined by CAG) =z (7.4.24) 
ce (6) 
where \(u) is the largest eigenvalue of the matrix T(u) of the associate de 
Bruijn graph constructed for W = {v : v = wiuyW2-++Wm—1Um—1Wm, Where 
uj € AX? 1 <i<m-l1}. 


(iii) For primitive patterns (cf. Section 7.3.2) a Local Limit Law holds: 


P(Q, =k ee : 7.4.25 
sp PP = 9 - song ae |~? Ga) Oa 
where 

_k-DPW)n 


a) Son 


forn— o. 


EXAMPLE 7.4.7. We motivated our desire to study the subsequence problem 
by an example form computer security. Indeed, if one wants to detect “suspi- 
cious” activities (e.g., signatures viewed as subsequences in an audit file), it is 
important to set up a threshold in order to avoid false alarms. This problem 
can be rephrased as one of finding a threshold ag = a9(W;n, 8) such that 


P(Q, es Qo) < B, 


for small given ( (say 3 = 107°). Based on frequencies of letters and the 
assumption that a memoryless model is (at least roughly) relevant, one can 
estimate the mean value and the standard deviation coefficients P(W), a(W) 
as discussed above. The Gaussian limits granted by Theorems 7.4.5 and 7.3.8 
then reduce the problem to solving an approximate system, which in the (fully) 
constrained case reads 


1 op 
ag = nP(W) + oa(W) Vn, B= — | et /2 at. 
V xo 


This system admits the approximate solution 


ag & nt(w) + a(W)/2nlog(1/8). (7.4.26) 


for small (. 


7.5. Generalized subsequence problem 


In the generalized subsequence problem the pattern is W = (W1,...,Wa) where 
W;, is a set of strings (a language). We say that the generalized pattern W 
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occurs in the text X if X contains W as a subsequence (wi, w2,...,Wa) where 
w; € W;. An occurrence of the pattern in X is a sequence 


(to, Wi, U1, - S , Wa, Ud) 


such that X = ugw uj, --:Waua. We shall study the associated language £ that 
can be described as 


Las Wea We (7.5.1) 


More precisely, an occurrence of W is a sequence of d disjoint intervals 
I= (hh, 1g,...Ia) such that I; := [kj k5] where 1 < kj < ki < nis a portion of 
text X7] where w; € W; occurs. We denote by P;, := Pn(W) the set of all valid 
occurrences J. The number of occurrences 2,, of W in the text X of size n is 


then 
M= So Zr, (7.5.2) 
IEP (L) 


where Z;(X) := [W occurs at position J in X]. 

In passing, we observe that the generalized subsequence problem is the most 
general pattern matching considered so far. It contains the exact string match- 
ing (cf. Section 7.2), generalized string matching (cf. Section 7.3), and the 
subsequence pattern matching known also as hidden patterns (cf. Section 7.4). 
In this section we present an analysis of the first two moments of 2,, for the gen- 
eralized subsequence pattern matching problem for dynamic sources discussed 
in Section 7.1. 


7.5.1. Generating operators for dynamic sources 


In Section 7.1 we have introduced a general probabilistic source known as a 
dynamic source. In this section we analyze the generalized subsequence model 
for such sources. 

We start with a brief description of the methodology of generating operators 
that are used in the analysis of dynamic sources. We recall from Section 7.1 
that the generating operator G, is defined as Gy[f](t) := |hi,(t)|f o hw(t) for 
a density function f and a word w. In particular, in (7.1.2) we proved that 
P(w) fy f(t)dt = f> Guw[f](#)dt for any function f(t), which implies that P(w) 
is an eigenvalue of the operator G,,. Furthermore, the generating operator for 
w-uis Gyw.u = Gyo Gy, where w and wu are words (cf. (7.1.3)) and o is the 
composition of operators. 

Consider now a language B C A*. Its generating operator B(z) is then 


defined as 
B(z) := ‘3 gO Gey: 
weB 


We observe that the ordinary generating function of a language B is related to 
the generating operators. Indeed, 


Bi := > 2 Pq) = ey Gulfoat= f B(z)[f](¢)dt. (7.5.3) 


weB weB 
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If B(z) is well defined at z = 1, then B(1) is called the normalized operator of 
B. In particular, using (7.1.1) we can compute 


P(B) = > Pw) =| B(1)dt. 


weB 


Furthermore, the operator 


Ge) Gi, (7.5.4) 
acA 
is the normalized operator of the alphabet A and plays a fundamental role in 
the analysis. 
From the product formula (7.1.3) of the generating operators G,, we con- 
clude that unions and Cartesian products of languages translates into sums and 
compositions of the associated operators. For instance, the operator associated 


with A®* is 
(I—2zG)':= s. 2G? 
i>0 
where G?*= Go G*-!, 

In order to proceed, we must restrict our attention to a class of dynamic 
sources called decomposable that satisfy two properties: (i) there exists a unique 
positive dominant eigenvalue \ and a dominant eigenvector denoted as y (which 
is unique under the normalization [ y(t)dt = 1); (ii) there is a spectral gap 
between the dominant eignevalue and other eignevalues. These properties entail 
the separation of the operator G into two parts 


G=)\P+N (7.5.5) 


such that the operator P is the projection relative to the dominant eigenvalue 
A while N is the operator relative to the remainder of the spectrum (cf. Sec- 
tion 7.3). Furthermore (cf. Exercise 7.5.1) 


PoP=P, (7.5 
PoN=NoP=0. (7.5 
The last property implies that for any i > 1 
G’ = 'P + N?. (7.5.8) 


In particular, for the density operator G the dominant eigenvalue \ = 
P(A) = 1 and yg is the unique stationary distribution. The function 1 is the left 
eigenvector. Then using (7.5.8) we arrive at 


ue —P? +R; (7.5.9) 
where 
R(z) = (I- 2N)"'- P = 5 24(G* —P). (7.5.10) 
k>0 
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Observe that the first part of (7.5.9) has a pole at z = 1 and due to the spectral 
gap the operator N has spectral radius vy < \ = 1. Furthermore, the operator 
R(z) is analytic in |z| < (1/v) and again thanks to the existence of the spectral 
gap, the series R(1) is of geometric type. We shall point out below that the 
speed of convergence of R(z) is closely related to the decay of the correlation 
between two consecutive symbols. Finally, we list some additional properties of 
just introduced operators (cf. Exercise 7.5.2) true for any function g(t) defined 
between 0 and 1. 


Nig] = 0, Plgl(t) = v(t) | g(t!)dt’ (7.5.11) 


[ Plama= f g(t)dt, [ Nina =o. (7.5.12) 


where ¢ is the stationary density. 

Theory built so far allows us, among others, to define precisely the correlation 
between languages in terms of the generating operators. From now on we restrict 
our analysis to the so called nondense languages B for which the associated 
generating operator B(z) is analytic in a disk |z| > 1. First, observe that for a 
nondense language B, the normalized generating operator B satisfies 


[ Peporian — P(B) (f scwr). (7.5.13) 


Let us now define the correlation coefficient between two languages, say B 
with the generating operator B and C with generating operator C. Two types 
of correlations may occur between such languages. If 6 and C do not overlap, 
then B may be before C, or after C. We define the correlation coefficient c(B,C) 
(and in an analogous way c(C,B)) as 


P(B)P(C)c(B,C) := )_ [P(B x A® x C) — P(B)P(C)] (7.5.14) 


k>0 
= | CoR(1) 0 Biyl(t). 


To see this we observe, using (7.5.5)—(7.5.13), 


[ coraoBia (at = | Co| S-(G*-P) | oBly|(t)dt 


0 0 D0 
1 1 
= 2 (/ Co G* o Biy|(t)dt -f CoPo Bla) 
k>0 S70 0 
= 50 (P(Bx A® x B) — P(B)P(C)). 
k>0 
We say that B and C overlap if there exist words b, u and c such that u 4 € 
and (bu, uc) € (6 x C)U(C x B). Then we denote by 6 f C the set of words that 
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be obtained by overlapping words from B and C. The correlation coefficient of 
the overlapping languages 6 and C is defined as 
P(B{C) 


Finally, the total correlation coefficient m(6,C) between B and C is defined as 
m(B,C) = ¢(8,C) + ¢(C,B)+ d(B,C), (7.5.16) 
that is, 


P(B)P(C)m(B,C) = P(B TC) 
+ S> [P(B x A® x C) + P(C x A* x B) —2P(B)P(C)]. 


k>0 


We shall need these coefficients in the analysis of the generalized subsequence 
problem for dynamic sources. 


7.5.2. Mean and variance 


In this section we shall derive the mean and the variance of the number of 
occurrences 0,(W) of the generalized pattern as a subsequence for a dynamic 
source, 
We first give a sketch of the forthcoming proof: 
e We first describe the generating operators of the language £ defined in 
(7.5.1) that we repeat it here 


L=A*xw, x A®-+- Wa x A*. 


It turns out that the quasi-inverse (I — zG)~' operator is involved in such 
a generating operator. 

e We then decompose the operator with the help of (7.5.9). We obtain 
a term related to (1 — z)~'P that gives the main contribution to the 
asymptotics, and another term coming from the operator R(z). 

e We then compute the generating function of £ using (7.5.3). 

e Finally, we extract asymptotic behavior from the generating function. 


The main finding of this section is summarized in the next theorem. 


THEOREM 7.5.1. Consider a decomposable dynamical source endowed in the 
stationary density y and let W = (Wi, W2,..., Wa) be a generalized nondense 
pattern. 

(i) The expectation E(Q,,) of the number of occurrences of the generalized 
pattern W in a text of length n satisfies asymptotically 


Bin) = ("F“\Pon+ ("427 !) Pow [cov - TOV] + O(n", 
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where 
d 
T(W) = dy dX, oe (7.5.17) 


is the average length, and the correlation coefficient C(W) is the sum of the 
correlations c(W;-1, W;) between languages W; and W;4+1 as defined in (7.5.14). 
(ii) The variance of Q,, is asymptotically equal to 


Var(Qn(W)) = 0?(W) n?4-1 (1+ O(n-1)), (7.5.18) 
where the coefficient 


3?(W) = P2(W) ae m(W) | 


d!(d — 1)! (2d — 1)! 
and the total correlation—coefficient m(W) can be computed as 
= i+9-—2\ (2d-i-j iets 
m(W) := 2D ( 4 ) ( a_i mM). 
1<i,j<d 
where m(W,, Wi+1) are defined in (7.5.16). 


Proof. We only prove part (i) leaving the proof of part (ii) as an exercise (cf. 
Exercise 7.5.3). We shall start with the language representation £ defined in 
(7.5.1) that we recalled above. Its generating operator is 


L(z) = (I-— 2G)“ oL, (z) o (I — zG)~1 0---oLy(z) 0 (I—2zG)~*. (7.5.19) 
After applying the transformation (7.5.8) to L(z), we obtain an operator M;(z) 


1 d+1 
Ma(:)= (+) PoL,(z)oPo---oPolLj;(z)oP 


that has a pole of order r+ 1 at z = 1. Near z = 1, each operator L;(z) is 
analytic and admits the expansion 


L,(z) = L; + (z-— 1)L5(1) + O(2 - 1)’. 


Therefore, the leading term of the expansion is 


1l-z 


1 d-1 
( ) PoL,oPo---oPolLj,oP. (7.5.20) 


The second main term is a sum of r terms, each of them obtained by replac- 
ing the operator L;(z) by its derivative L/(1) at z = 1. The corresponding 
generating function M;(z) satisfies near z = 1 


Mis (; + -\ PW) - (5 = -) Powyrow +0 (4)". 
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where the average length T(W) is defined in (7.5.17). 

After applying (7.5.8) in L(z), we obtain an operator M2(z) that has a pole 
of order r at z = 1. This is asum of d+1 terms, each of the term containing an 
occurrence of the operator R(z) between two generating operators of languages 
Wi-1, W;. The corresponding generating function M2(z) has also a pole of order 
rat z= 1 and satisfies near z = 1 


1 \¢ d 1 \@1 
Ma(z) = ( ) P(W) c(Li-1, Li) +O (=) j 
1-z ss 1-z 
Here, the correlation number c(6,C) between 6 and C is defined in (7.5.14). To 
complete the proof we need only to extract the coefficients of P(z)/(1— z)4, as 
already discussed in previous sections. a 


7.6. Self-repetitive pattern matching 


In this last section of the chapter, we change the model. So far we postulated 
the pattern w is given. Hereafter, we make the pattern part of the text, which 
is still randomly generated. To simplify our presentation, we assume that the 
text is emitted by a memoryless source. We should point out that the quantity 
analyzed here is in fact the typical depth in a (compact) suffix trie built over 
the suffixes of a randomly generated text. 


7.6.1. Formulation of the problem 


Let i be an arbitrary integer smaller than or equal to n. We define D,,(i) to 
be the largest value of k < n such that X/**~ occurs at least twice in the 
text X1 of length n; in other words, such that N,(X/t*~') > 2. We recall 
that N,(w) is the number of times pattern w occurs in the text Xj’. Clearly, 
Nn(X asia > 1. Our goal is to determine probabilistic behavior of a “typical” 
D,(i), that is, we define D, to be equal to D,(i) when i is randomly and 
uniformly selected between 1 and n. More precisely, 


for any 1 <<n. 
Let w € A* be an arbitrary word of size k. Observe that 


P(Da(i) >k & XP* 1 = w) =P(Na(w) > 2& XP* 1 =), 


and 
n 


S > P(Nn(w) =r & XP 1 = w) =rP(N,(w) =r). 


i=l 
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Recall that N,,(u) = E(u%»™)) = 7. P(Nn(w) = r)u” is the probability 
generating function of N,,(w). We sometimes shall write Npw(w) to underline 
the fact that the pattern w is given. From above we conclude that 


P(Dy > b) = +37 P(Pali) > #) 


= 1y*P(Ds(i) She 2k? aa) 


weAk i=l 
1 
== S- > rPO.@)=7) 
wEeAk r>2 
1 
= ¥ (Pe) - 491,00) 
wEAk 
1 / 
=1- a s: Nea); 
weAk 


where N,, ,,(0) denotes the derivative of N;,(u) at u=0 
Let now D,(u) = E(u?") = >>, P(D, = k)u* be the probability generating 
function of D,,. Then above implies 
1ad- 
Da(u) == 2—4 > ally, (0) 


nr U 
weA* 


and the bivariate generating function D(z,u) = )>,, nD,(u)z” becomes 


D(z, u) = — Ye ule 2 u(2,0) (7.6.1) 


Q 


where N,,(z,u) = \opg pep P(Nn(w) = r)z"u". In Section 7.2 we worked 
with yor. yg oe P(Nn(w) = 1)z"u" and in (7.2.20) of Theorem 7.2.7 we pro- 
vided a formula for it. Adding the term No(z) = Sw(z)/Dw(z) we finally arrive 
at 
zl¥lP(w) u Sw(z) 

ou) = pay Tue)” Dale) 
where M,,(z) is defined in (7.2.21) and D,,(z) = (1 — z)Su(z) + z!”!P(w) (cf. 
7.2.24) with S,,(z) being the autocorrelation polynomial for w. Since 


a _ ,|wj Pw) 
5 Nul(,0) =2! AGy 


we finally arrive at the following lemma that is the starting point of the subse- 
quent analysis. 
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LEMMA 7.6.1. The bivariate generating function for Dy, is 


ee ne) 
D(z,u) >, eu)!" 5) + ae (7.6.2) 


for |u| < 1 and |z| < 1, where S,,(z) is the autocorrelation polynomial for w. 


In this section, we prove the following result for a random text generated 
by a memoryless source over a finite alphabet A of size V with p; being the 
probability of emitting symbol i € A. We denote by h = — ae pi log p; the 
entropy rate of the source, and hg = > 4 Pi log? p;. The reader is asked in 
Exercise 7.6.1 to extend below theorem to Markov sources. 


THEOREM 7.6.2. (i) For a biased memoryless source (i.e., pi # p; for some 
i#j) and anye > 0 


1 h 

E(Dn) = > logn + + 53 + Pi(logn) + O(n), (7.6.3) 
hg — A? 

Var(D,,) = 73 logn + O(1) (7.6.4) 


where P\(-) is a periodic function with small amplitude in the case where the 
tuple (log pi,..., 

logpv), is collinear with a rational tuple (i.e., logp;/logp,; = r/s for some 
integers r and s) and converges to zero otherwise. 

Furthermore, (Dy, — E(D,,))/Var(D,,) is asymptotically normal with mean zero 
and variance one that is, for fixed x € R 


1 7 ‘ 
lim P{D, < E(D,) + 2vV Var(Dn)} = — | ent /2qt 
Wotee V2T Joo 
and for all integer m 
; Dy —E(Dn)]|™ 0 when m is odd 
lim E | —— | = ml ce 
n—0o 4 /VarD , 372 (Eyl when m is even. 
(ii) For the unbiased source (i.e., p) = ++: = py =1/V), hg = h?, the expected 


value E(D,,) is given by (7.6.3) above, and for any ¢ > 0 


1 


1 
+ — + P(logn) + O(n~*) 


Var(D,,) = ———- 
(Dn) 6log?V 12 


where P (log n) is a periodic function with small amplitude The limiting distri- 
bution of D,, does not exist, but one finds 


lim sup | P(D, < x) — exp(—nV *) |= 0 


for any fixed real x. 
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In passing we observe that the quantity D,, is also the depth of a randomly 
selected suffix in a compact suffix trie. Such a trie is a compacted version of 
suffix tries defined in Chapter 2. In a compact suffix trie one deletes all unary 
nodes at the bottom of the non-compact suffix trie. Observe that in a compact 
suffix trie, which we further call simply a suffix trie, the path from the root to 
node i (representing the ith suffix) is the shortest suffix that distinguishes it 
from all other suffixes. The quantity D,,(¢) defined above represents the depth 
of the 7 suffix in the associated suffix trie, while D,, is the typical depth, that is, 
the depth of a randomly selected terminal node in the suffix trie. Theorem 7.6.2 
tells us that the typical depth is normally distributed with the average depth 
asymptotically equal to z logn and variance O(logn) for biased memoryless 
source. In the unbiased case variance is O(1) and the (asymptotic) distribution 
is of the extreme distribution type. Interestingly, as proved below, the depth 
in a suffix trie (built over one sequence generated by a memoryless source) 
is asymptotically equivalent to the depth in a trie built over n independently 
generated strings. Thus suffix tries resemble tries! 


7.6.2. Random tries resemble suffix tries 


The proof of Theorem 7.6.2 hinges on establishing asymptotic equivalence be- 
tween D,, introduced above and a new random variable D? defined as follows: 
First, for n independently generated texts (by the same memoryless source as for 
Dn) we denote by D2(i) for an integer i < n the length of the longest prefix of 
the ith text that is also a prefix of another text, say the jth text, 7 4 7. Then the 
random variable D2’ is defined by selecting integer i uniformly between 1 and n. 
We also define D7 (u) = >>, P(DZ = k)u® and D?(z,u) = >, nDZ (u)z”. Ob- 
serve that D7 is in fact the typical depth in a trie built over these n independent 
texts. 

It is relatively easy to derive the generating function of D7, as shown below. 


LEMMA 7.6.3. For alln>1 


l—u 


DE (u) = 


for all |u| <1 and |z| <1. 


Proof. It suffices to observe that 


P(D; (i) <k) = S> P(w)(1- P(w))"?. 


wEeArk 


Indeed, D7(i) < k if there is a word w € A* such the a prefix of the ith string 
is equal to w and none of the other text prefixes are equal to w. 7 
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Our goal now is to prove that D,(u) and D7(u) are asymptotically close as 

n — co. This requires several preparatory steps outlined below that will lead 
to 

D? (u) — Dn(u) = (1 — u)O(n-*) (7.6.5) 


for some € > 0 and all |u| < @ for G > 1. Consequently, 
|P(D, < k) — P(Dj < k)| = O(n *8-*) 


for all positive integers k. In Lemma 7.6.11 we shall prove that D? is asymp- 
totically normal, hence D,, is normal. This will prove Theorem 7.6.2. 

We start with a lemma indicating that for most words w the autocorrelation 
polynomial S.,,(z) is very close to 1 for z non-negative. This lemma provides 
information about analytical properties of the autocorrelation polynomial. 


LEMMA 7.6.4. There exist 6 <1, 90> 0 and p> 1 such that pd < 1 and 


DE [Sw(e) — 1) < (16) *8JP(w) > 1 — 065%. (7.6.6) 


weA® 


Proof. To simply notations, let P, be the probability measure on A* such 
that P,(A) = do.,c4x[w € A]P(w). Thus we need to prove that P,(Sw(p) < 
1+ (p6)*0) > 1—06*. 

Let i be an integer smaller than k € P(w), where P(w) is the autocorrelation 
set for w. It is easy to see that (cf. Exercise 7.6.2) 


i—r 


V "fv 
Py(k-—i€ P(w)) = {So pyhltt |) | So pl (7.6.7) 
j=l j=l 


where r = k — |k/i|t. Denoting p = max; p; we have 


P,(k —1 € P(w)) < p**. 


k/2 


Thus P,(max(P(w) — {k}) > k/2) < DIM! P.@+ 1 € P(w)) < EE. Now, if 
the word w is such that max(P(w) — {k}) < k/2, then Si,(p) < . [k/2) OP’ S 
pres. Therefore, it suffices for (7.6.6) to select 5 = \/p, 9 = (1—p)~! and 
p> 1such that pd < 1. a 


In the next lemma we show that D(z, wu) can be analytically continued above 
the unit disk, that is, for |u| > 1. 


LEMMA 7.6.5. The generating function D(z,u) can be analytically continued 
for all |u| < 6~' and |z| <1 where 6 <1. 
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Proof. Let |u| < 1 and |z| < 1. Consider the following identity 


uz) lel P(w) = 1 
pm ) (l—z)? (l—uz)(1—z)?” 


Ww 


Therefore, for |z| < 1 


uD(z.u) - =) ey Tew'Pw) (2 1 
DG) - Ge ajaaae LY Po (ma aa) 
= (u= 1) Dew P00) Faq aya Pwle) - 1- ))(Dwl2) + (1-4), 


where we recall D,,(z)—(1—z) = (1—z)(Sw(z)-1)+P(w)z!”!. By Lemma 7.6.4 
Pe(|Dw(z) — (1 — 2)| < ((1 — 2| + )8!™!) > 1 - O(6™!) 


for all w such that |w| = k. Moreover, for any bounded function f(w) such that 
f(w) < fmax for all w with |w| = k, we also have the following estimate for all 
y: 
Yo P(w)f(w) Sy + finaxPe(f(w) > y) - (7.6.8) 
|w|=k 
In particular, we take f(w) = D(z) — (1 — z) and we have fmax = O(1) since 
|Sw(z)| < (1 — p)~+ (p as defined as in proof of lemma 7.6.4). Now taking 
y = (|1 — z|+1)6*, using the above we obtain 


nD@ah= See = (w—1) Y(eu)*O((1 = 2] + 1)5" + 5%) 


for all w. In conclusion, 


(1—u) _ u—1 
we") ~ Gea ap =O (Fag) 


for 6 < 1 and |z| < 1, which completes the proof. 1: 


Before we proceed, we need two technical lemmas. 


LEMMA 7.6.6. There exists K, a constant p' > 1 and a> 0 such that for all 
w with |w| > K we have 
|Sw(z)| 2 


for |z| < p’ with p’ > 1 such that pp’ < 1. 


Proof. Let € be an integer and p’ > 1 such that pp’ + (pp')’ < 1. Let k > @ and 
let w such that |w| > k. Let 1 = max(P — {k}). Ifa < 2, then for all z such that 
|z| < p’ we have 
1\e 
ISu(2)| = 1 — PPI 
1 = pp’ 
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Ifi > @, let q= |k/i|, then w = u4v where u is the prefix of length 7 of word w, 
and v is the suffix of length k — ig. Thus 


Sole) = SEF + (Plu)zt)* Bl) 


where Si,,(z) is the autocorrelation polynomial of wv. This implies 


1—(pp')® — (pp')4 — (pp')* 


But since 7 > @, we obtain 


aees)s Leste 


0 
- 1+ pp’ a 


which completes the proof. rT 


LEMMA 7.6.7. There exists an integer K’ such that for |w| > K’ there is only 
one root of Dy(z) in the disk |z| < p’ for p’ > 1. 


Proof. Let Ky be such that (pp’)** < a(p’ — 1) holds for the a and p’ as in 
Lemma 7.6.6. Denote kK’ = max{K, K,}, where K is defined above. Note also 
that the above condition implies that for all w such that |w| = k > K’ we have 
P(w)(p')* < a(p—1). Hence, for |w| > K’ we have |P(w)z*| < |(z—1)$.(z)| on 
the circle |z| = p’ > 1. Therefore, by Rouché’s theorem the polynomial D,,(z) 
has the same number of roots as (1 — z)Sw(z) in the disk |z| < p’. But, the 
polynomial (1—z)Sj(z) has only a single root in this disk since by Lemma 7.6.6 
we have |.5,,(z)| > 0 in |z| < p’. rT 

We just establish that there exists the smallest root of D(z) = 0 that we 
denote as A,,. Let also Ci, and D,, be the first and the second derivatives of 
D(z) at z = Aw, respectively. Using bootstrapping, one easily obtains the 
following expansions 


1 2 
Aw = 1+ ZP(w) + O(P(u)”), 


287,(1) 
Sw(1) 


C= —S:(i)4 (i ) P(w) + O(P(w)?), 
3S""(1) 
5. (1) 


By = 205 ie (i Sih ) P(w) + O(P(w)?), 


where S/,(1) and S'"(1), respectively, denote the first and the second derivatives 
of Sy(z) at z=1. 

Finally, we are ready to compare D,,(u) with D7(u) to conclude that they 
do not differ too much as n — oo. Let us define two new generating functions 
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Qn(u) and Q(z, u) that represent the difference between D,(u) and D7 (u), that 


Qn(t) = —— (Dn(w) — DE (w)) 
and - 
Q(z,u) = ey NQn(u)2z” = i _ (D(z, u) — DT? (z, u)) : 
n=0 
Then 


. zlvl z 
Q(z,u) = ae IP(w) oo 7 ie) . 


It is not difficult to establish asymptotics of Q,,(u) by appealing to the Cauchy 
theorem. This is done in the following lemma. 


LEMMA 7.6.8. There exists B > 1 such that for all |u| < @ the following 
evaluation holds 


Ww 


+ O(B~”) 
for some 3 > 1. 


Proof. By Cauchy’s formula 


1 d 
nQn(u) = in ¢ C2 wa 


where the integration is along a loop contained in the unit disk that encircles the 
origin. Let w be such that |w| > AK’, where K’ is defined in Lemma 7.6.7. From 
the proof of Lemma 7.6.7 we conclude that D,,(z) and (1 — z+ P(w)z) have 
only one root in |z| < p for some p > 1. Applying Cauchy’s residue theorem we 
obtain 


1 es dz zie z - 
Bin PO PO ee Cac, Ge Pa) = 


Ae aa jw) Ew 
C2 Ay C3, 


= llPGe) ( ; ) —n(1 — Puy) + Tu(o, 4), 


where 


teow Bf, S ane (poeR- a) 


To establish a bound for [,,(p, u) we argue exactly in the same manner as in the 
proof of Theorem 7.6.5. This leads for |w| > K’ to 


S> Iw(p,u) = O((dpu)*p-”) 


|w|=k 
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since for all w we also have S,,(e) < 1/(1—pp) and D,,(z) = O(p*) in the circle 
|z| < p. Set now @ = (dp)~! > 1. Then, for |u| < 6 we have 


> 2) gel iy ) = O(p-”). 
{w: |w|>K’} 


This proves our bound since the other terms (|w| < A’) contribute only B~” 
for some B > 1 due to the fact that all roots of D,,(z) have magnitudes greater 
than 1. . 


In the next lemma we show that Q,(u) — 0 as n — oo. 
LEMMA 7.6.9. For all 1 < @ < 671, there exists « > 0 such that Q,(u) = 
(1 — u)O(n~*) uniformly for |u| < 6. 


Proof. The expansion of E,, with respect to P(w), and Lemma 7.6.4 show that 
as n — oo the following holds >, u!”!P(w) AZ” Ew /C3, = O(1). Therefore, by 
Lemma 7.6.8 we have 


» AC oe 
= Dat iP(w (Fe a ae) : + O(1/n) . 
Let now f(x) be a function defined for x real by 
Aer ae 
fw() = —Ga— — 1 - P(w)) “ 


By the same arguments as used in proving (7.6.8) in Theorem 7.6.5, we 
note that >>, ul”'P(w) fw(x) is absolutely convergent for all x and u such that 


|u| < @. The function fu(x) = fw(x) — fw(O)e~* is exponentially decreasing 
when x — +00 and is O(a) when x — 0; therefore its Mellin transform defined 


as = 
= | fu (x)x>— ‘dx 
0 


is well defined for R(s) > —1. In this region we obtain 


pa wi|-1 (log Aw)" -1 — (—log(1- P(w))* -1 
Su(s) = T(s) (4s | ~ A,,C2 ee ; 


where I'(s) is the gamma function. Let g*(s,u) be the Mellin transform of the 
series >, ul”'P(w) fw(x) which exists at least in the strip (—1,0). Formally, 


we have 5 
aS) \ Fi (a). 


We can reverse the Mellin transform g*(s, wu) provided that the following holds. 


LEMMA 7.6.10. The function g*(s,u) is analytical in R(s) € (—1,c) for some 
c>0. 
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Assuming Lemma 7.6.10 is granted, we have 


1 E+ioo 

Qn(u) = 5 _, 9 U)n*ds + O(1/n) + Dd wiP(w) fale 

for some ¢ € (0,c). Notice that the last term of the above contributes O(e—"”), 
and can be safely ignored. Furthermore, a simple majorization under the inte- 


gral gives the evaluation Q,,(u) = O(n~*) which completes the proof. rT 
Proof of Lemma 7.6.10: We establish the absolute convergence of g*(s,u) for all 
s such that R(s) € (—1,c) and |u| < G. Let us define h*(s,u) = ae Note 
that for any fixed s we have the following 


(og Aw)~* = (FP) 0 OPW) 
(—log(1 — P(w)))* = Pw)" O(P(w))) 
Thus 
(log Aw)" 1 _ (Clog( — P(w)))-# 1 
pao 1 — P(w) 


= P(w)* [(1 + aw(1))°(L + O(/w|P(w)) — (1 + O(P(w))] + O(/w|P(w)) . 
By Lemma 7.6.4, Py (S(1) < 1+ 65*) > 1 — O(6*), and hence 


n*(s,u) = Jo (supp-®, q-® }fuls) O14) 


k=0 


that absolutely converges for all values of s such that (s) < c where c satisfies 
sup{p°,q-°} < (53)~+. Since h*(0, u) = 0 by definition, the pole of I'(s) at s = 
0 is canceled in g*(s,u), and therefore h*(s,u) does not show any singularities 
in the strip R(s) € (-1,c). rT 

To complete the proof of our main Theorem 7.6.2, we need an asymptotic 
analysis of D2(w) which is presented next. We recall that D7? represents also 
the typical depth in a trie built from n independently generated strings. 


LEMMA 7.6.11. There exists ¢ > 0 such that 
Dy (u) = (1 —u)n*™ (DP(«(u)) + P(logn, u))) + O(n‘), 
where 


Vv 
u 2D p, = 1 
i=1 


and P(logn,u) is periodic function with small amplitude in the case where the 
vector (log p1, 

...,log py) is collinear with a rational tuple, and converges to zero when n — oo 
otherwise. 
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Proof. We begin with the abies 


Di (u I P(w)(1 — P(w))"?. 


wEeA* 


We argue exactly in the a manner as in the proof of Lemma 7.6.8. We find 
the Mellin transform T*(s,u) = [5° «*~'dau/(1 — u)D}(u)dzx to be 


= SF ul¥!P(w)(—log(1 — P(w)))*T(s). 


weA* 


Using the fact that for s bounded (— log(1—P(w)))~* = P(w)7*(1+O(sP(w))), 
we conclude 


T*(s,u) =I(s) a + i) ; 


8) 
g(s,u) =O YP 1Pi re | 
i= al Pi 


Let K(u) be the main root of 1 = ye 1p; *. The other roots of 1 = 


ioe, p, *, are countable and we denote them as «,(u) for k # 0 integer. 
For all integers k we have R(#;,(u)) > «(u). Using the inverse Mellin we find 


Ds “| T*(s,u)n “ds. 


2imtu 


where 


—too 


We now consider |u| < 6~' for 6 < 1. Then there exists ¢ such that for R(s) < ¢ 
the function g(s,u) has no singularity. Moving the integration path to the left 
of #(s) =e, and applying the reside theorem we find the following estimate 


i. or P(q(u)) | (a) = PKR (Y)) rex (u) nee 
DF (u) =(1 a +(1 ace + O(n-*) (7.6.9) 


with h(u) =—>-, pre) log pj and hy(u) = — 32, p, () log p;. When log p;’s 
are collinear with a rational vector, then there is subset of K,(u) that have the 
same real part as K(u) and also equally spaced on the vertical line R(s) = 
R(K(u)). In this case their contribution to (7.6.9) is 


nh(u) ee Ys ((KE(w) — K(u))ilogn). 


When the log p,’s are not collinear with a rational vector the contribution of the 
kx(u) divided by n*™) converges to zero when n — oo. r 


The last lemma completes the proof of Theorem 7.6.2. Indeed, it suffices to 
observe that for t — 0 


K(e’) = ct + 5 42 + Q(t) (7.6.10) 
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where cy = 1/h and cz = (hz — h?)/h3. We concentrate first on the asymmetric 
case. From the expression of D2'(w) we find immediately the first and the second 
moments via the first and the second derivatives of D7 (u) at u = 1 with the 
appropriate asymptotic expansion in c; logn and in cg logn. In order to obtain 
the limiting normal distribution we prove 


2 
ett log n/V/c2 log gyt (e/v@ i _+ et /2 


using n*) = exp(«(u) logn) and referring to expansion (7.6.10). 

For the symmetric case there is no normal limiting distribution since variance 
is O(1). However, there are oscillation due to the fact that all «;,(u) are aligned 
on a vertical line. This completes the proof of Theorem 7.6.2. 


Problems 
Section 7.2 


7.2.1 Prove (7.2.9). 

7.2.2. In Theorem 7.2.8 we prove that for irreducible aperiodic Markov chain 
the variance Var(NV,,) = nc + c2 (cf. (7.2.26)). Prove that c; > 0. 

7.2.3 Prove that (N;, — E(N,,))/./ Var(N,,) converges in moments to the ap- 
propriate moments of the standard normal distribution. 

7.2.4 Let p(t) be a root of 1 — e'My(e?) = 0. Observe that p(0) = 0. Prove 
that p(t) > 0 for t £0 for p;; > 0 for all i,7 € A. 

7.2.5 Prove the expression (7.2.44) for 0, of Theorem 7.2.12 (cf. Denise and 
Régnier (2004)). 


Section 7.3 


7.3.1 Extend the analysis of Section 7.3 to multisets W, that is, a word w; 
may occur several times in W. 

7.3.2 Prove language relationships (7.3.2)—(7.3.2). 

7.3.3 Derive explicit formulas for 6, appearing in Theorem 7.3.3(iv). 

7.3.4 Find explicit formulas for the values of the mean E(N,,(W)) and of the 
variance Var(N,,(W)) for the generalized pattern matching discussed in 
Section 7.3.2 for Wo =@ and Wo # 0. 

7.3.5 Derive explicit formulas for a, and 6, in (7.3.27) appearing in Theo- 
rem 7.3.10. 

7.3.6 Enumerate (¢, 4) sequences over a non binary alphabet (i.e., generalize 
the analysis of Section 7.3.3). 


Section 7.4 


7.4.1 Find an explicit formula for the generating function BY I(z) of the col- 
lection BI?!, 
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Section 


7.5.1 
7.5.2 
7.5.3 


7.5.4 


Section 


7.6.1 
7.6.2 


7.6.3 


Notes 
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Design a dynamic programming algorithm to compute the correlation 
algorithm, «?(W). 

Establish the rate of convergence for the Gaussian law from Theo- 
rem 7.4.5. 

For the fully unconstrained subsequence problem establish the large 
deviations (cf. Janson (pear)). 

Provide details of the proof for Theorem 7.4.6. 

Let W = {wi,..., wa} be a set of patterns w;. The pattern W occurs as 
a subsequence in the text if any of w; occurs as a subsequence. Analyze 
this generalization of the subsequence pattern matching. 

Let w be a pattern. Set W to be a window size with |w| < W <n. 
Consider the windowed subsequence pattern matching in which w must 
appear as a subsequence within the window W. Analyze the number of 


windows that has at least one occurrence of w as a subsequence within 
the window (cf. Gwadera, Atallah, and Szpankowski 2003). 


7.5 


Prove the generating operators identities (7.5.5)—(7.5.8). 

Prove (7.5.11)—(7.5.13). 

Prove the second part of Theorem 7.5.1, that is, derive formula (7.5.18) 
for variance of 0, (WV). 

Does the central limit theorem holds for the generalized subsequence 
problem discussed in Section 7.5? What about large deviations? 


7.6 


Extend Theorem 7.6.2 for Markov sources. 


Prove (7.6.7) and extend it to Markov sources (cf. Apostolico and Sz- 
pankowski 1992). 


Let «(u) be the main root of 1 = uso, and «,(u) for k A 0 
{i= 


; °. Prove that for all integers k 


integer are other roots of 1 = u ae D 
we have R(K,(u)) > K(x). 


Algorithmic aspects of pattern matching are presented in numerous books. We 
mention here Crochemore and Rytter (1994) and Gusfield (1997) (cf. also Apos- 
tolico (1985)). Public domain utilities like agrep, grappe, webglimpse for find- 
ing general patters were recently developed by Wu and Manber (1995), Kucherov 
and Rusinowitch (1997), and others. Various data compression schemes are 


studied 


in Wyner and Ziv (1989), Wyner (1997), Yang and Kieffer (1998), 


Ziv and Lempel (1978), Ziv and Merhav (1993)). Prediction based on pattern 
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matching is discussed in Jacquet, Szpankowski, and Apostol (2002). Algorith- 
mic aspect of pattern matching can also be found in Chapter 2 and Chapter 8 
of this book. 

In this chapter the emphasis is on analysis of pattern matching problems 
by analytic methods in a probabilistic framework. Probabilistic models are dis- 
cussed in Section 7.1 and Chapter 1. Markov models are presented in many 
standard books (cf. Karlin and Ost (1987)). Dynamic sources were intro- 
duced by Vallée (2001) (cf. also Clement, Flajolet, and Vallée (2001), Bourdon 
and Vallée (2002)). General stationary ergodic sources are discussed in Shields 
(1969). 

In this chapter analytic tools are used to investigate combinatorial pattern 
matching problems. The reader is referred to Alon and Spencer (1992), Sz- 
pankowski (2001), Waterman (1995) (cf. also Arratia and Waterman (1989, 
1994)) for in-depth discussion of probabilistic tools. Analytic techniques are 
thoroughly explained in Sedgewick and Flajolet (1995) and Szpankowski (2001). 
The reader may also consult Atallah, Jacquet, and Szpankowski (1993), Bender 
(1973), Clement et al. (2001), Hwang (1996), Jacquet and Szpankowski (1994, 
1998). The Perron—Frobenius theory and the spectral decomposition of matri- 
ces can be found in Gantmacher (1959), Karlin and Taylor (1975), Kato (1980) 
Szpankowski (2001). Operator theory is discussed in Kato (1980). 

Exact string matching is presented in Section 7.2. There are numerous 
references. Our approach is founded in the work of Guibas and Odlyzko (1981a) 
and Guibas and Odlyzko (1981b). The presentation of this section follows very 
closely recent work of Régnier and Szpankowski (1998a) and Régnier (2000). 
More probabilistic approach is adopted in Chapter 2 and in Prum et al. (1995). 
Example 7.2.13 is taken from Denise, Régnier, and Vandenbogaert (2001). 

Generalized string matching problem discussed in Section 7.3 was introduced 
in Bender and Kochman (1993). The analysis of string matching over reduced 
set of patterns appears in Régnier and Szpankowski (1998b) (cf. also Guibas 
and Odlyzko (1981b)). An automaton approach to motif finding was proposed 
in Nicodéme et al. (2002). The general string matching was first dealt with in 
Bender and Kochman (1993), however, our presentation follows a different path 
simplifying previous analyses. It is closely related to the subsequence pattern 
matching analysis presented in Flajolet, Guivarc’h, Szpankowski, and Vallée 
(2001). The (@,&) sequence analysis is taken from Szpankowski (2001). 

The subsequence pattern matching or the hidden pattern matching discussed 
in Section 7.4 is based on Flajolet et al. (2001). Proceeding along different 
tracks, Janson (pear) has related this particular case to his treatment of U— 
statistics via Gaussian Hilbert spaces; see Chapter XI of Janson’s book Janson 
(1997) for the type of method employed. Example 7.4.7 was fully developed in 
Gwadera et al. (2003). 

The generalized subsequence pattern matching discussed in Section 7.5 is 
taken from Bourdon and Vallée (2002). The operator generating function ap- 
proach for dynamic sources was developed by Vallée (2001). 

In Section 7.6 we present some results for the self-repetitive pattern match- 
ing. Theorem 7.6.2 was proved in Jacquet and Szpankowski (1994), however, our 


d 
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proof in this section is somewhat simplified. In particular, proof of the crucial 
Lemma 7.6.1 is new and based on results presented in Section 7.2. Lemma 7.6.11 
is due to Jacquet and Régnier (1986) (for an extension to Markov sources see 
Jacquet and Szpankowski (1991)). Mellin transform is explained in depth in 
Flajolet, Gourdon, and Dumas (1995), Szpankowski (2001). Tries are treated in 
depth in Mahmoud (1992) and Szpankowski (2001). As mentioned, the quantity 
D,, analyzed in the section is also the typical depth in a suffix tries introduced in 
Chapter 2 (cf. also Apostolico (1985)). Probabilistic analysis of suffix tries can 
be found in Apostolico and Szpankowski (1992), Devroye, Szpankowski, and 
Rais (1992), Szpankowski (1993a, 1993b). As discussed in the section, suffix 
tries are often appear in analysis of data compression schemes (cf. Wyner and 
Ziv (1989), Wyner (1997), Yang and Kieffer (1998), Ziv and Lempel (1978), Ziv 
and Merhav (1993)). 
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8.0. Introduction 


Repetitions (periodicities) in words are important objects that play a funda- 
mental role in combinatorial properties of words and their applications to string 
processing, such as compression or biological sequence analysis. Using proper- 
ties of repetitions allows one to speed up pattern matching algorithms. 

The problem of efficiently identifying repetitions in a given word is one of 
the classical pattern matching problems. Recently, searching for repetitions in 
strings received a new motivation, due to the biosequence analysis. In DNA 
sequences, successively repeated fragments often bear an important biological 
information and their presence is characteristic for many genomic structures 


Version June 23, 2004 


400 Periodic Structures in Words 


(such as telomer regions for example). From a practical viewpoint, satellites 
and alu-repeats are involved in chromosome analysis and genotyping, and thus 
are of major interest to genomic researchers. Thus, different biological studies 
based on the analysis of tandem repeats have been done, and even databases of 
tandem repeats in certain species have been compiled. 

In this chapter, we present a general efficient approach to computing different 
periodic structures in words. It is based on two main algorithmic techniques — 
a special factorization of the word and so-called longest extension functions — 
described in Section 8.3. Different applications of this method are described in 
Sections 8.4, 8.5, 8.6, 8.7, and 8.8. These sections are preceded by section 8.2 
devoted to a combinatorial enumerative properties of repetitions. Bounding the 
maximal number of repetitions is necessary for proving complexity bounds of 
corresponding search algorithms. 


8.1. Definitions and preliminary results 


Consider a word w = a,---dn. A position w is any integer @ with 1 <€ <n. 
Any word « such that there exist i < j with « = a;---a; is called a factor of w. 
If a specific pair of integers (7, 7) with this property is meant, we speak about an 
occurrence of the factor «. We say that this occurrence starts at position 7 and 
ends at position j in w and denote it wii..j]. We say that a factor occurrence 
v = wit..j] contains a position ¢ of w, if i < 0 <j. 

Recall (see Chapter 1) that an integer p is called a period of w if a; = ai+p, 
for alli such that 1 < i,i+p <n. Equivalently, p is a period of w iff w[1..n—p] = 
w[p + L..n]. If p is a period of w, any factor u of w of length p is called a root 
of w. In other words, wu is a root of w iff w is a factor of u” for some natural n. 

Each word w has a minimal period that we will denote p(w). The roots u 
of w such that |u| = p(w) are called cyclic roots. We also call the cyclic root 
w([l..p(w)] the prefix cyclic root of w, and the cyclic root w[n — p(w) +1..n] the 
suffix cyclic root of w. 

The rational number e(w) = |w|/p(w) is called the exponent of w. If e(w) > 
2, then w is called a repetition (periodicity). If k = e(w) is an integer greater 
than 1, w can be written as u* = uu---u (k times) and is called an integer 
power (k-power, or tandem array in biological literature). A word which is 
not an integer power is called primitive. An integer power of even exponent is 
commonly called a square (or a tandem repeat). These are words of the form wu 
for some word u. An integer power of exponent 2 is called a primitively-rooted 
square, as it corresponds to a square wu where u is primitive. In general, any 
word w of minimal period p and exponent e can be written as ne where uw is 

bod 


a primitive word, |u| = p, v is a proper prefix of u and e = k+ Tal" 


The following proposition specifies some properties of repetitions. 
PROPOSITION 8.1.1. A word r of length m is a repetition of minimal period 


p< m/2 if and only if one of the following conditions holds: 
(i) r[l..m—p| = r[p+1..m], and p is the minimal number with this property, 
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(ii) any factor of r of length 2p is a square, and p is the minimal number with 
this property. 


From now on, we will be interested in repetitions occurring as factors of 
some word, that is in factor occurrences r = wii..j] with e(r) > 2. A maximal 
repetition in a word w, is a repetition r = w{i..j] such that 

(i) ifi > 1, then p(wl[t — 1..7]) > p(wii..j]), and 

(ii) if 7 <n, then p(w[t..7 + 1]) > p(wii..J]). 
In other words, a maximal repetition is a repetition r = wii..j] such that no 
factor of w that contains r as a proper factor has the same minimal period 
as r. For example, the factor 10101 in the word w = 1011010110110 is a 
maximal repetition (with period 2), while the factor 1010 is not. Other maximal 
repetitions of w are prefix 10110101101 (period 5), suffix 10110110 (period 3), 
prefix 101101 (period 3), and the three occurrences of 11 (period 1). (Note that 
different repetitions can be equal words.) 

Any repetition in a word can be extended to a unique maximal repetition, 
that we will call the corresponding maximal repetition. For example, the repe- 
tition 1010 in word w = 1011010110110 corresponds to the maximal repetition 
10101 obtained by the extension by one letter to the right. 

A basic result about periods is the Fine and Wilf’s theorem: 


THEOREM 8.1.2 (Fine and Wilf). If w has periods p,, p2, and |w| > pi + p2 — 
gcd(pi,p2), then gcd(p1, p2) is also a period of w. 


Two factor occurrences wii..j] and w[k..¢] are said to overlap if their intervals 
(i..j] and [k..¢] have a non-empty intersection. The overlap of the two factors 
is then the factor w[r..s] where [r..s] is the intersection of [7.7] and [k..4. The 
following lemma will be used in the sequel. 


LEMMA 8.1.3. Two distinct maximal repetitions with the same period p can- 
not have an overlap of length greater than or equal to p. 


Proof. From a case analysis of relative positions of two repetitions of period p, 
it follows that if they intersect on at least p letters, at least one of them is not 
maximal. | 


A repetition r occurring in a word w is said to have a root in some factor 
occurrence of w, if r overlaps with this occurrence by at least p(r) letters. Also, 
we say that a repetition r has a root on the right (respectively on the left) of a 
position € of w with the meaning that r overlaps by at least p(r) letters with 
the suffix w[¢+ 1..n] (respectively, the prefix w[1..¢ — 1]). 

The following reformulation of Lemma 8.1.3 will be also useful. 


COROLLARY 8.1.4. If two maximal repetitions of w contain a position ¢ and 


have a root on the right (on the left) of £, then either these maximal repetitions 
have different minimal periods or they coincide. 
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We will also use the following proposition which is again a consequence of 
the Fine and Wilf theorem. 


PROPOSITION 8.1.5. If u is a primitive word, then u cannot be an internal 
factor of uu (that is, a factor which is not a prefix or suffix). 


Proof. If u is an internal factor of wu, then u = xy = yx, where x,y are 
nonempty words. We show that x and y are powers of the same word, by 
induction on the length of zy. The claim holds if |x| = |y|. Otherwise, assume 
|x| > |y|. Then w = yz for some nonempty word z. It follows that zy = yz. By 
induction, y and z are powers of the same word, and so are x and y. Thus, wu is 
not primitive. rT 


8.2. Counting maximal repetitions 


Before considering algorithmic issues related to repetitions, we briefly study in 
this section the combinatorial question of the number of repetitions occurring 
in a word. The main point here is to show that considering maximal repetitions 
leads to a compact linear-space representation of all repetitions. This result 
will motivate building an efficient algorithm for computing all maximal repeti- 
tions, and, on the other hand, will be used to obtain bounds on its algorithmic 
complexity. 


8.2.1. Counting squares: an upper bound 


We start with the question of how many square occurrences can a word contain. 
Clearly, if all squares are counted, a word can contain a quadratic number of 
those (e.g. 1”). If we restrict our attention to primitively-rooted squares, the 
following result holds. 


THEOREM 8.2.1. The number of occurrences of primitively-rooted squares in 
a word of length n is O(nlogn). 


The proof is based on the following “lemma of three squares”: 


LEMMA 8.2.2. Consider three squares x”, y?, z? and assume that z? is a prefix 
of y? and y? is a prefix of x7. Assume that z is a primitive word. Then 
al < |al/2. 


Proof. Denote w = x?. Assume that |z| > |a|/2. Let p = |y| — |z| and 


k = |z| — p = 2|z| — |y|. Observe that prefix w[l..k] of z is also a suffix of z, 
and therefore w[1..k] = w[p+ 1.4 +p] = wip + 1..|z|]. We show by induction 
that for every i,k +1<i< k+p, we have w|i] = wli+ p]. We have wii] = 
wlt + yl] = wit + |y| — al], as 7+ |y| > 2/2] > ||. By induction, the latter is 
equal to w[t + |y| — |x| + p] = wli + |y| +p] = w[t +p]. We showed that z (and 
even z”) occurs at position p +1 in w. By Proposition 8.1.5, this contradicts 
that z is primitive. rT 
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Proof of Theorem 8.2.1 Lemma 8.2.2 implies that the number of primitively- 
rooted squares occurring as prefixes of a word w is less than 2 log, |w|. Indeed, 
from Lemma 8.2.2, it follows that if w has as its prefixes 7 primitively-rooted 
squares, then |w| > 2*/?. Theorem 8.2.1 follows. : 


8.2.2. Repetitions in Fibonacci words 


Fibonacci words are words over the binary alphabet {0,1} defined recursively 
by fo = 0, fi = 1, fn = fn—-ifn—2 for n > 2. The length of f,, denoted 
F,,, is the n-th Fibonacci number. Fibonacci words have numerous interesting 
combinatorial properties and often provide a good example to test conjectures 
and analyze algorithms on words. 

The following lemma summarizes some properties of Fibonacci words that 
we will need in this section. 


LEMMA 8.2.3. (i) For n > 2, f, and the word fn—2fn—1 share a common 
prefix of length F,, — 2. 
(ii) For every n, fy, is a primitive word. 
(iii) Every repetition occurring in Fibonacci words has the minimal period Fj, 
for some k, and has a cyclic root fx. 
(iv) Fibonacci words contain no repetition of exponent 4. 


Proof. (i) is easily proved by induction. To prove (ii), observe that for n > 2, 
fn contains F,,-; occurrences of 1, as can be easily proved by induction. If 
fn is not primitive that is f, = w* for a non-empty word w and k > 2, then 
both F,, = |fn| and F,,-1 (the number of 1’s in f,,) are divisible by k which is 
a contradiction as F,, and F,,-; are mutually prime, which can be again easily 
proved by induction. Proving (iii) and (iv) is beyond the scope of this chapter 
(see Notes). rT 


We now prove that Fibonacci words realize the asymptotically largest num- 
ber of square occurrences, according to Theorem 8.2.1. This shows, in particular, 
that the O(nlogn) bound of Theorem 8.2.1 is asymptotically tight. 


THEOREM 8.2.4. Fibonacci word f,, contains O(F,, log F,,) occurrences of pri- 
mitively-rooted squares. 


Proof. Let S;, be the number of occurrences of primitively-rooted squares in fi. 
By induction on n, we will show that S,, > Fn logs Fn, forn > 5. Forn=5 
and n = 6, the inequality is verified directly. Assume now that n > 7. Consider 
the decomposition fp, = fn—1fn—2 and call the position between f,-1 and fn—2 
the frontier. Clearly, all squares in f, are divided into those which lie entirely 
in fn—1 or fr—2 and those which cross the frontier, i.e. overlap both with fp—1 
and with f,—2. Therefore, 


Sn+1 = Sn + Sn-1 + Bhs 
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where S44 is the number of primitively-rooted squares crossing the frontier. 
By Lemma 8.2.3(i), the prefix of f,41 of length (F,41 — 2) is also a prefix of 
the word fn—ifn = fn—ifn—1fn—2. Since fn—2 is a prefix of f,—1, then fr41 
contains (f,-2 — 1) squares of period F,,-; that cross the frontier. Since fp—1 
is a primitive word (Lemma 8.2.3(ii)), all those squares are primitively-rooted. 
Since 


Peis J nin = tin ifn afn afn 3 


and f,—3 is a prefix of f,—2, then the suffix fp_-ofn—2fn—3 of fn41 contains 
F,,-3+1 squares of period F,,-2 that cross the frontier. Since f,—2 is a primitive 
word, all those squares are primitively-rooted too. 

We obtain that $41 > (Pr-2 —1) + (Fn-3 +1) = Fr-1. Therefore, 


1 1 
Sn4i = Sn + Sn-1 + Fn-1 = gin logs Fy = gin logs Fn-1 + Fn-1 


Fas Fyn n Fn-1 Fi 
lo + logs Fy, + lo + logs Fy, 
6 Ea (log. Baa S2 Fn+1) Fatt (log. Fag 2 Fn+1) 
6F,_- Fy 
+ *] = = (logy Fn41 +A), 
Pyqi 
wee F, F, F, F, 6F, 
A= n log n rs n—-1 log n—1 me n—1 ; 
Fpai 82 Fast Frat 82 Fy4t Fy4t 


Both the first and the second term of the expression A is greater than or equal 
to ming<cz<1i(t log, 7) = — eae The third term is greater than 3/2, since 
Fi41 < 2F; for every i. We derive that 


A> 


21 
Soe ee ew, 
e 2 e 2 


We conclude that S,41 > fats logs Fn41. The upper bound follows from The- 


orem 8.2.1. : 

In view of Theorem 8.2.1, one might want to estimate the maximal number 
of occurrences of primitively-rooted integer powers in a word, and conjecture 
that this number is asymptotically smaller than the number of square occur- 
rences, since each primitively-rooted integer power of exponent e represents 
e — 1 primitively-rooted squares. In particular, one might want to count the 
occurrences of non-extensible primitively-rooted integer powers, that is those 
primitively-rooted integer powers u”, k > 2, which are not followed or preceded 
by another occurrence of u. Fibonacci words provide a counter-example to this 
hypothesis, as by Lemma 8.2.3, all integer powers contained in them are of ex- 
ponent two or three, and therefore the maximal number of their occurrences is 
still O(F, log F;,), as implied by Theorem 8.2.4. 

What happens if we count maximal repetitions instead of occurrences of 
integer powers or just squares? Note that a word can contain much less maximal 
repetitions than integer powers: e.g. if v is a square-free word over a three-letter 
alphabet, then word v$v$v contains |v| + 1 non-extensible integer powers (here 
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squares) but only one maximal repetition. What is the number of maximal 
repetitions in Fibonacci words? The following theorem gives the exact answer. 


THEOREM 8.2.5. Let R, be the number of maximal repetitions in f,. Then 
for alln > 4, Ry, = 2Fy_2 — 3. 


Proof. As in the proof of Theorem 8.2.4, we divide all maximal repetitions 
in f, into those which lie entirely in f,—1 or f,—2 and those which cross the 
frontier, i.e. overlap with f,—; and with f,—2. We call such repetitions crossing 
repetitions of f,. Overlaps of a crossing repetition with f,—; and f,—2 are 
called the left part and the right part respectively. Note first that the left part 
and the right part of a maximal repetition cannot be both of exponent greater 
than or equal to 2, since Fibonacci words don’t have factors of exponent 4 
(Lemma 8.2.3(iv)). If either the left or the right part is of exponent at least 2, 
then this repetition is an extension of a maximal repetition of respectively f,—1 
or fn—2. This implies that the only new maximal repetitions of f,, that should 
be counted are crossing maximal repetitions with both right and left part of 
exponent smaller than 2. We call those repetitions composed repetitions of fp. 
Let c(n) be their number. The following lemma allows to compute c(n). 


LEMMA 8.2.6. Let R, be the number of occurrences of maximal repetitions in 
the Fibonacci word f,, and set Ry = Ryn—1 + Rn—2 + c(n). Then for all n > 8, 
c(n) = c(n — 2). 


Proof. Consider the representation 


tn Th ifn 2= fn afin 3lfn 3fn a= fa alfn 3lfn alfn 5n 4 (8.2.1) 


where | denotes the frontier, n > 5, and square brackets delimit the occurrence 
of fn—2 with the same frontier as for the whole word f,, (we call it the central 
occurrence of f,—2). By Lemma 8.2.3(iii), the minimal period of every repetition 
in Fibonacci words is equal to Fy for some k. Since F,-3 > Fy—4 > 2Fy,~6, it 
follows from (8.2.1) that if a composed repetition of f,, has the minimal period 
Fi, for k < n — 6, then it is also a composed repetition of f,—2 and therefore 
is counted in c(n — 2). Vice versa, every composed repetition of f,—2 with the 
period F;, for k < n — 6, is also a composed repetition of f,. We now examine 
crossing maximal repetitions of f,, with minimal periods Fy,_2, Fr—3, Fn—a, 
Fy_s. 

Crossing repetitions with the minimal period F;,-2. The last term of (8.2.1) 
shows that square (f;—2)? is a prefix of f, that crosses the frontier. As F,_1 < 
2F,,-2, the corresponding maximal repetition does not have a square in its left 
or right part and therefore is composed for f,. Since F,-2 > F,/3, any two 
maximal repetitions of f,, with the period F,-2 overlap by more than F,,_-2 
letters. By Lemma 8.1.3, f, has only one maximal repetition with the period 
Fy, -2. Trivially, the repetition under consideration is not a repetition for the 
central occurrence of f,—2. 
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Crossing repetitions with the minimal period F,,-3. From the decomposition 
fn = fn—2fn—3| fr—3fn— (see (8.2.1)), there is a square (fn—3)? of period F,_3 
crossing the frontier. The corresponding maximal repetition does not extend 
to the left of the left occurrence of fp,—3, as the last letters of f,—3 and fn—2 
are different (the last letters of f;’s alternate). Therefore, this repetition does 
not have a square in its left or right part, and thus is composed for f,. As 
this maximal repetition has a root both on the left and on the right of the 
frontier, it is the only repetition with the period F,,-3 crossing the frontier 
(see Corollary 8.1.4). Again, from length considerations, it is not a maximal 
repetition for the central occurrence of fp_2. 

Crossing repetitions with the minimal period F,,-4. Since 


fea ta fn afin 5in 4=fn ifn afin 5in 5in 6 
= fn fn afin 5fn 6fn vin 6=fn (fn wT fe win 65 


there is a square (f,—4)? on the right of the frontier. However, this maximal 
repetition does not extend to the left (ie. does not cross the frontier), since the 
last letters of f,-4 and f,_, are different. 

On the other hand, 


fn = fn al fn al fn alfn 5in ee 3hn altn afin slfn 5 Fy 6lfn 5in 4 
= 7, 3fn altn afn slfn 6 fin vin 6] fn ie 4; for n > 6. 
f 


This reveals a maximal repetition with the period F;,_4 which crosses the fron- 
tier. However, this is not a composed repetition of f,, as it has a square on the 
left of the frontier. On the other hand, the restriction of this repetition to the 
central occurrence of f,—2 is a composed repetition for fr_2. 

There is no other repetition with the period F,,_4 crossing the frontier, since 
such would overlap with one of the two above by more than one period, which 
would contradict Lemma 8.1.3. In conclusion, there is one composed repetition 
with the period F;,_4 in the central occurrence of f,—2 and no such repetition 
in fr. 


Crossing repetitions with the minimal period F,_5. Rewrite 


fn = Fn altn afin slfn 5in él fn 5 4 


which shows that there is a square of period F,,_5 crossing the frontier. Since the 
frontier is the center of this square, the latter corresponds to the only crossing 
repetition with the period F;,_5. However, this repetition is not a composed 
repetition for f,, as it has a square in its right part, as shown by the following 
transformation: 


ns = Ti alfn atin slfn 5in 6l fn 6fn vin 4 
ae al fn afin slfn 5 In 6lfn adn sin vin 4; for n> 8. 
fi 
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On the other hand, the restriction of this repetition to the central occurrence of 
fn—2 is a composed repetition for f,—2. Thus, there is one composed maximal 
repetition with the period F,,_5 in the central occurrence of f,—2 and no such 
repetition in fr. 

In conclusion, two new composed repetitions arise in f, in comparison to 
fn—2, but two composed maximal repetitions of the central occurrence of f,_—2 
are no more composed in f,, as they extend in f,, to form a square in its right 
or left part. This shows that c(n) = c(n — 2) for n > 8. rT 


Proof of Theorem 8.2.5 (continued): A direct counting shows that Ro = 0, Ri = 
0, Rx = 0, Ry = 0, Ry = 1, R; = 3, Re = 7, Ry = 13. Therefore, e(3) = 
0, c(4) = 1, c(5) = 2, c(6) = 3, c(7) = 3. Since c(n) = c(n — 2) for all 
n > 8, then c(n) = 3 for all n > 6. We then have the recurrence relation 
Ry, = Rn-1 + Rn—-2 +3 for n > 6 with boundary conditions Ry = 1, Rs = 3. 
Substituting R, = 2R/,_.—3, we get the relation R), = Ri, + Ri_» for n> 4, 
where RS = 2, Rs = 3. This defines exactly the Fibonacci numbers. Thus, 
Ri =F, for n> 2, and R, = 2F,-2 —3 for n> 4. r] 


Note that by Lemma 8.2.3(iv), any maximal repetition in a Fibonacci word 
has the exponent smaller than 4, and therefore not only the number of maximal 
repetitions is linearly bounded, but also the sum of their exponents is linearly 
bounded on the word length. 


8.2.3. Counting maximal repetitions 


The results on Fibonacci words suggest a conjecture that arbitrary words contain 
only a linear number of maximal repetitions and moreover, that the sum of their 
exponents is linearly-bounded too. In this section we confirm these conjectures. 
Later in Section 8.4, this result will allows us to derive a linear-time algorithm 
for identifying all maximal repetitions in a word. 


THEOREM 8.2.7. Let R(w) be the set of all maximal repetitions in a word w 
of length n (over an arbitrary alphabet), and let E(w) = D7 er(w) e(")- Then 
E(w) = O(n). 


The existing proof of Theorem 8.2.7 is very technical and is done by a tedious 
case analysis. We don’t include it here and refer the reader to Kolpakov and 
Kucherov 2000b. Finding a simple proof of Theorem 8.2.7 remain an open 
problem. 

As a corollary of Theorem 8.2.7, we obtain that the number of maximal 
repetitions in a word over an arbitrary alphabet is linearly-bounded in the length 
of the word. 


THEOREM 8.2.8. Let R(w) be the set of all maximal repetitions in a word w 
of length n. Then Card (R(w)) = O(n). 


Together with the previous results of Section 8.2, this confirms that the set 
of maximal repetitions is a more compact representation of all repetitions than 
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> 
4 
Acs 


i-k+1 LP,(i—k+1) 


Figure 8.1. Case LP,(i-k+1)>2 


the set of primitively-rooted squares or primitively-rooted non-extensible integer 
powers. 


8.3. Basic algorithmic tools 


From now on, we will be interested in the algorithmic question: How to effi- 
ciently compute different types of repetitions in a given word? In the following 
sections, we present algorithms that allow to compute efficiently all maximal 
repetitions, as well as other types of repetitions. All those algorithms are based 
on a common general technique, and therefore a “secondary goal” of this chap- 
ter is to demonstrate the power of this approach. The technique is based on 
two main tools that we describe in this section. 


8.3.1. Longest extension functions 


The first tool is longest extension functions. Computing those functions (which 
are basically arrays of integer values, indexed by word positions) is an important 
component of the algorithms to be presented. Longest extension functions come 
in different variants — here we present a basic formulation and we will refer to 
it afterwards. 

Consider a word v of length n. For each position 7 of v, we want to compute 
the longest factor of v which starts at position i and is also a prefix of v. 
Formally, we want to compute, for all 7 € [1..n], the value LP,(i) defined as 
maximal ¢ > 0 such that v[1..4] = vfi..i+ @-— 1] and LP,(i) = 0 if no such 
positive @ exists. Note that DP,(1) = n and, by convention, we always set 
LP,(n +1) = 0. For example, for v = 101101011011, the values of DP,(i) for 
i=1,...,18 are respectively 12,0,1,3,0,6,0,1,4,0,1,1,0. 

We now describe an algorithm that computes DP, in O(n) time. The al- 
gorithm processes v from left to right and computes LP,(i) for all positions 7 
successively. The computation is based on the following idea. Assume we have 
computed LP,(j) for all 7 < i, and we are about to compute LP,(i). Assume 
we have stored a position & < 7 that maximizes k + LP,(k), and assume that 
k+LP,(k) >i. Set €=k+ LP,(k) —i and consider the value LP,(i—k+1). If 
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; LP,(k) 
kot wit LP,(i-k+1)] 


be —t > 


VU 


vf[LP,i-k+1)+1] 


Figure 8.2. Case LP,(i—k+1) <2 


LP,(t—k+1) > ¢ (see Figure 8.1), then we claim that LP,(i) = ¢. Indeed, it is 
easily seen that v[i..1+¢—1] = v[1..4]. On the other hand, v[é+ 1] 4 v[i + 4], as 
vit 4] = o[k+LP,(k)] A v[LP,(k)+1], and v[LP,(k)+1] = v[é+1]. Therefore, 
LP, (t) = ¢. 

If LP,(i—k+1) < £ (see Figure 8.2), then we show that LP, (i) = LP,(i—k+ 
1). Again, v[t..0+ DP,(¢—k+1)—-1) = v[l..LP,i-—k+1)], but ofi+DP,(i-—k+ 
1)] A v[LP,(i—k+1)4+1], since of[i+ LP, (i—k+1)] = vofi—k+14+LP,(i—k+1)] 
and vii -kK+1+0P,i-k+1)] Av LP(i-k+1)+ 1). 

Putting together the two cases above, if LP,(i-—k+1) 4 @, then LP, (i) = 
min{/, DP,(i —k + 1)}. The only case when LP,(i) cannot be computed im- 
mediately is DP,(i-k+1) = @. In this case, we keep on reading letters 
vli+ Q, vit @41],... while v[fi+@+ 3] =v[€+ J], and thus compute the value 
LP,(t). Position 7 is then stored as the new value of k. The resulting linear-time 
algorithm LONGEST-PREFIX-EXTENSION is shown below. 

Function LP, can be generalized to compute, for all positions of v, the 
longest factor that starts at this position and is a prefix of another fixed word. 
Formally, for two words v[1..n] and w[1..m], we define LP,), to be the function 
associating with every position i € [1..n] of v the maximal length of a factor of 
word vw which starts at position i and is a prefix of w. Note that this factor 
starts inside v but can overlap with w. All values of LP,),, can be computed in 
time O(m +n). 
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LONGEST-PREFIX-EXTENSION(v[1..7]) 


1 LP(ljan 

2 7-0 

3. while j <n —2 and v[j + 1] = v[j + 2] do 
4 jogt+l 

5 LP(2)—3 

6 k<«2 

7 fori+3tondo 

8 €—k+LP(k)-i 

9 if LP(i—k+1)# and @>0 then 
10 LP(i) — min(é, LP(i — k + 1)) 
11 else 7 — max(0, @) 

12 while i+ j <n and v[i + 7] = v[y +1] do 
13 jogjtl 

14 LP(i) —j 

15 ki 


16 return LP 


Symmetrical functions can be defined with respect to suffixes instead of 
prefixes. For a word v[1..n], we define LS,(i) to be the length of the longest 
factor of v that ends at position 7 and is a suffix of v. For v[1..n] and w[l..mJ, 
LSyw\y(t), t € [1..n] is the maximal length of a factor of wv that ends at position 
m +i in wv and is a suffix of w. Both these functions can be computed in 
time linear in the involved words, using an algorithm similar to LONGEST- 
PREFIX-EXTENSION. In particular, the function computing LS,,),, will be called 
LONGEST-SUFFIX-EXTENSION(w, v) in the sequel. 


8.3.2. s-factorization and Lempel-Ziv factorization 


The second basic algorithmic tool is a special factorization of the word, that will 
allow to speed up our repetition-finding algorithms. There are several variants 
of this factorization that we will use in the following sections. The difference is 
of technical nature and the choice of the definition will be basically guided by the 
convenience in describing the algorithm. We distinguish two factorizations that 
we call s-factorization and Lempel-Ziv factorization, following the terms under 
which they have been introduced in the literature. Each of those factorizations 
comes in two variants, thus yielding four different definitions. 
Let w be an arbitrary word. The s-factorization of w (respectively, s- 
factorization with non-overlapping copies) is the factorization w = fifo--- fr, 
where f;’s are defined inductively as follows: 
e fi = wil], and if letter a occurring in w immediately after fi fo--- fi-1 
does not occur in fy fo-++ fi_-1, then f; = a. 

e otherwise, f; is the longest factor occurring in w immediately after 
fifo-:+: fi-1 that occurs in fi fo--: fi-1f; other than as a suffix (respec- 
tively, that occurs in fi fo--- fi-1). 
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The Lempel-Ziv factorization of w (respectively, Lempel-Ziv factorization 
with non-overlapping copies) is the factorization w = fi fo-+: fr, where f;’s are 
defined inductively as follows: 

e fi = wil], 

e for i > 2, f; is the shortest factor occurring in w immediately after 
fifo:--fi-1 that does not occur in fi fo--: fi-if; other than as a suffix 
(respectively, that does not occur in fi fo--- fi—-1 at all). 

In the s-factorization of the word w, we look for the longest factor f; starting 
after fi fo--- fi-1 which has a copy on the left. In the s-factorization with non- 
overlapping copies, this copy is required to be non-overlapping with f;. In the 
Lempel-Ziv factorization, we look for the shortest word that does not have an 
occurrence on the left. In other words, we extend by one letter the longest 
word which does have a copy on the left. If the Lempel-Ziv factorization with 
non-overlapping copies is considered, then the copy is required not to overlap 

As an example, consider the word w = 1100101010000. Its s-factorization 
is 1]1|0|0|10|1010|000, and the s-factorization with non-overlapping copies of w 
is 1]1]0|0|10]10|100|00. The Lempel-Ziv factorization of w is 1]10|01|010100|00 
and the Lempel-Ziv factorization without copy overlap is 1|10|01|010|1000)0. 

If w = fifo--- fe is the s-factorization (respectively, Lempel-Ziv factoriza- 
tion), we call each f; an s-factor (respectively, LZ-factor) of w. 

A remarkable feature of all considered factorizations is that each of them 
can be computed in a time linear in the length of the word. This can be 
done in different ways, using a data structure like the suffix tree or the DAWG 
(Directed Acyclic Word Graph). A possible algorithm consists in computing the 
factorization along with constructing the data structure in an on-line fashion. A 
factorization provides a very useful information about the structure of repeated 
factors and the possibility to compute the factorization in linear time makes of 
it a key algorithmic tool that we will use throughout the rest of this chapter. 


8.4. Finding all maximal repetitions in a word 


According to the results of Section 8.2, maximal repetitions are important struc- 
tures, as they encode, in a most compact way, all repetitions occurring in the 
word. If the set of maximal repetitions is known, repetitions of any other type 
can be extracted from it: primitively- or non-primitively rooted squares, cubes, 
etc. 

In this section, we show that the set of all maximal repetitions can be com- 
puted very efficiently, namely in a time linear in the length of the word. The 
linear time bound is supported by Theorems 8.2.7, 8.2.8 that guarantee that 
the output itself is of linear size (assuming that each maximal repetition is rep- 
resented in constant space, e.g. by the start position, the period and the total 
length). 

We first consider the following auxiliary problem. Assume we are given two 
words « = a2[l..m], y = y[l..n], and consider their concatenation v = ry = 
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Figure 8.3. Illustration to Theorem 8.4.1 


v[l..m +n]. We want to find all maximal repetitions r = v{i..j] in the word 
v that contain the frontier between x and y, i.e. such that i < m+ 1 and 
j >m. Every such repetition belongs (non-exclusively) to one of two classes: 
the repetitions which have a root in y and those which have a root in x. Note 
that by Corollary 8.1.4, for every p, 1 < p < n, there is at most one maximal 
repetition with a period p that contains the frontier between xz and y and has a 
root in y. This shows, in particular, that the number of such repetitions is not 
greater than n. Similarly, the number of repetitions that contain the frontier 
and have a root in x is not greater than m, and thus, the number of maximal 
repetitions in v = xy which contain the frontier between x and y is bounded by 
(m+n). 

Let us focus on maximal repetitions r which have a root in y, those which 
have a root in x are found similarly. Consider longest extension functions LP, 
and LS,), (see Section 8.3). The following theorem holds. 


THEOREM 8.4.1. For 1 < p < n, there exists a maximal repetition with a 
period p inv = xy that contains the frontier between x and y and has a root in 
y iff 

LS,14(p) + LP,(p+ 1) > p. (8.4.1) 
If the inequality holds, this maximal repetition is vjm — LSz\y(p) + 1..m+ p+ 
LP,(p + 1)] (see Figure 8.3). 


Proof. Assume there is a square s = v[é..€+2p—1] of period p that contains the 
frontier between « and y, i.e. such that €< m+1 and €+2p—1>m. Assume 
that s has a root in y, i.e. +p > m. Observe that prefix y[1..¢+p—m-— 1] of y 
is equal to y[p+1..€+ 2p—m-—1]. This implies that LP,(p+1) > €+p—m-1. 
On the other hand, suffix z[é..m] of x is equal to v[é + p..m + p], and then 
LSz\y(p) => m—€+1. Inequality (8.4.1) follows. 

On the other hand, if (8.4.1) holds, then any factor v[¢..¢ + 2p — 1] with 
m — LSz,(p) +1 < € < m+ LP,(p+1)—p+t1 is a square of period p. 
Therefore, r = vu[m — LSgy(p) + 1..m+ p+ LP,(p + 1)] is a repetition. It 
remains to see that r is maximal, i.e. extended to the right and left as far as 
possible. This follows from the definition of the longest extension functions DS 
and LP. a 
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Theorem 8.4.1 yields an algorithm for computing all maximal repetitions 
which contain the frontier between « and y and have a root in y. The algorithm, 
called RIGHT-REPETITIONS, is shown below. It assumes that each repetition is 
represented by a pair of its start and end positions. 


RIGHT-REPETITIONS(a, y) 
1 LP, — LONGEST-PREFIX-EXTENSION(y) 


2 LS, — LONGEST-SUFFIX-EXTENSION(z, y) 

3 RO 

4 for p—1to |y| do 

5 if LSz)y(p) + LPy(p +1) > p then 

6 r— (m—LS,),(p) + 1,m+p+ LP,(p + 1)) 
7 R—-RuU{r} 

8 return R 


The repetitions containing the frontier between x and y and having a root in 
x are computed similarly (function LEFT-REPETITIONS hereafter). As the func- 
tions LONGEST-PREFIX-EXTENSION(y) and LONGEST-SUFFIX-EXTENSION(y) 
run in time O(|y|) and O(|z|+|y|) respectively, all maximal repetitions in v = xy 
which contain the frontier between x and y can be computed in time O(|v]). 


Let us now come back to the problem of computing all maximal repetitions 
in a word. The s-factorization is another useful tool for building a linear-time 
algorithm for this problem, due to the following theorem. 


THEOREM 8.4.2. Let w = fifo---f, be the s-factorization of w. Let r be 
a maximal repetition in w that contains the frontier between f;—_1 and f; and 
ends inside f;. Then the prefix of r which is a suffix of f, --- f;- is smaller than 


fel + 2] fe-a]- 


Proof. Assume r = w{é..m] and denote b; the position of the last letter of f;, i.e. 
b; = |fifo--: fil. The theorem asserts that if @< b;-1+1 and bi_1+1<m< bj, 
then bj-1 +1- L < | fil + 2| fi—11- 

Consider the suffix cyclic root r’ = w[m — p(r) + 1..m] of r. Observe that r’ 
has a copy p(r) letters to the left. If r’ starts at a position before the start of 
fi-1, i.e. m—p(r) < bj-2, then it includes entirely the factor f;-1 and at least 
the first letter of f;. This contradicts that f;_1 is the longest factor occurring on 
the left, according to the definition of s-factorization. Therefore, m—p(r) > bj—2 
and then p(r) = |r’| < |fi-1| + |fil- 

On the other hand, r cannot extend to the left of b;-2 +1 by p(r) letters or 
more, as this would again contradict the definition of f;-1. Thus, the part of r 
before the start of f;-1 is bounded by |f;-1| +|fi|. The theorem follows. a 


For the simplicity of presentation, we assume that the last letter of w does 
not occur elsewhere in w. Given the s-factorization w = fi fo--- fe, we consider, 
for each s-factor f;, i € [2..k], all maximal repetitions that end either at the last 
position of f;-1, or at some position of f; except the last one. Formally, these 
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are repetitions r = w[é..m] such that bj-1 < m < b;, where bj = |fifo--- fil. 
Cleary, each maximal repetition of w belongs to exactly one such class. Note 
that since the last letter of w is unique, the last s-factor f;, consists of one letter 
and there is no maximal repetitions that occur as suffixes of w. 

We further split all considered repetitions into two sets that we call repeti- 
tions of type 1 and 2: 


repetitions of type 1: maximal repetitions r that contain the frontier between 
fi-1 and f; and end at a position strictly smaller than the end of fi, 


repetitions of type 2: maximal repetitions r that occur properly inside fj. 


Repetitions of type 1 are repetitions r = w|..m] such that either m = b;-1, 
or b:;-1 + 1 < m < b; —1 and & < b;-1 +1. By Theorem 8.4.2, the former 
cannot extend by more than |f;-1| + 2|fi—2| to the left of f;-1, and therefore 
its length is bounded by 2] f;-1|+ 2|f;-2|, and the latter cannot extend by more 
than |f;| + 2|fi-1| to the left of f;. Joining both cases together, a repetition 
of type 1 cannot extend by more than max{2|f;—-1| + 2|fi—2|,| fil + 2|fi-al} = 
2| fi—1| + max{2| fi_-9|, | fil} to the left of ya 

Therefore, to find all repetitions of type 1, we consider the word t; f;, where 
t; is the suffix of fy --- fi-1 of length 2|f;-1|+ max{2| fi_2|, | fi|} (t; is the whole 
word f,--- fi-1 if its length is smaller than 2| f;_-1|++max{2| fi—2|, | fil}). We then 
have to find in t;f; all maximal repetitions that contain the frontier between 
t; and f; and don’t include the last letter of f;. This can be done in time 
O(\fi-2| + |fi-1| + | fi) using longest extension functions, as described above. 
Summing up over all s-factors, all repetitions of type 1 in w can be found in 
time O(n). 

Every repetition of type 2 occurs entirely inside some s-factor f; of the 
s-factorization, and each f; has an earlier occurrence in w. Therefore, each 
maximal repetition of type 2 has another occurrence on the left. This implies, 
in particular, that finding all repetitions of type 1 guarantees finding all dis- 
tinct maximal repetitions, and in particular all leftmost occurrences of distinct 
maximal repetitions. 

We are left with the problem of finding all repetitions of type 2. Here is how 
this can be done. 

During the computation of the s-factorization we store, for each s-factor fi, 
a pointer to an earlier occurrence of f; in w. Computing such a pointer does 
not affect the linear time complexity of s-factorization. Let v; be this earlier 
occurrence of f;, and let A; be the difference between the position of f; and the 
position of v;. Obviously, each repetition of type 2 occurring inside f; is a copy 
of a maximal repetition occurring inside v; shifted by A; to the right. 

We first sort, using bucket sort, all maximal repetitions of type 1, found 
at the first stage, into n lists end[1],...,end[n] such that list end[j] contains all 
maximal repetitions with end position 7. Then we process all lists end|j] in the 
increasing order of j and sort the repetitions again, using bucket sort, into n lists 
start|1],..., start[n] according to their start position. After this double sorting, 
the repetitions with the same start position 7 are sorted inside the list start|j] 
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in the increasing order of their end positions. As there is a linear number of 
repetitions of type 1, both sorting procedures take a linear time. 

We will use the same lists start[j] to store maximal repetitions of type 2. 
For each s-factor f; and for each internal position 7 inside f;, we have to find all 
maximal repetitions starting at this position and ending strictly inside f;. We 
then have to find all maximal repetitions from the list start[j — A;] which end 
inside v;, and then shift them by A; to the right. Note that these repetitions 
may be either of type 1, or previously found repetitions of type 2. We look 
through the list start|j — A;] and retrieve its prefix consisting of those maximal 
repetitions which end inside v;. Then we shift each of these maximal repetitions 
by A; and append a modified copy of this prefix to the head of the list start|j]. 
Note that the data structure is preserved, as all appended repetitions must end 
before any of repetitions of type 1 previously stored in the list start[j]. Since 
we process f;’s from left to right, no maximal repetition can be missed. Thus, 
we recover all repetitions of type 2 and after all f;’s have been processed, the 
data structure contains all maximal repetitions of both types. 

Note that when we retrieve a prefix of the list corresponding to some position 
in v;, each repetition in this prefix results in a new maximal repetition of type 
2 in f;. This shows that the time spent on processing the lists is proportional 
to the number of newly found maximal repetitions. Theorem 8.2.7 states that 
the number of all maximal repetitions is linear in the length of w. This proves 
that the whole algorithm takes a linear time. 

The whole algorithm for computing all maximal repetitions is summarized 
below. 


MAXIMAL-REPETITIONS(w{[L..n]) 

(fi,---, fe) <— s-factorization of w 

2 pb first stage 

3 RO 

4 fori+1tokdo 

5 t; — suffix of fi --- fi_1 of length 2| f,1| + max{2| f;—2l, | fil} 
6 

7 

8 


eR 


Ri. — R1iGHT-REPETITIONS (ti, fi) 
RY — LEFT-REPETITIONS (ti, fi) 
R-RUR,URY 
9 p second stage 
10 for each r = (j,£) € R do 


11 add r to list end[é] 

12 for 2-1 tondo 

13 for each r = (7, ¢) from list end[¢] do 

14 add r to list. start[j] 

15 fori<1tok do 

16 for j — b;_-1 +1 to b; do 

17 for each r from list start|j — Aj] such that |r| < 6; —j+1 do 
18 add repetition r’ = (j,7 + |r| — 1) to list start|y] 

19 R-RU{r'} 

20 return R 
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The complexity of MAXIMAL-REPETITIONS follows from the above discus- 
sion: 


THEOREM 8.4.3. MAXIMAL-REPETITIONS finds all maximal repetitions in a 
word of length n in time O(n). 


The set of all maximal repetitions provides an exhaustive information about 
the repetition structure of the word. It allows to extract all repetitions of other 
types, such as (primitively- or non-primitively-rooted) squares, cubes, or integer 
powers. Thus, all these tasks can be done in time O(n+T) where T is the output 
size. 

As another application, the set of maximal repetitions allows to determine, 
in linear time, the number d;(7) of primitively-rooted integer powers of a given 
exponent k, starting at each position z of the word. Here is how this can be done. 
For each position 7 € [1..n] of the input word w, we create two counters b(?) 
and c(i), initially set to 0. For each repetition r = wl[é..m], we increment b(¢) 
and c(m — kp(r) +1) by 1 ([é..m — kp(r) + 1] is the interval, where primitively- 
rooted k-powers induced by repetition r start). By Theorem 8.2.8, the number 
of updates is linear. To compute the numbers d;(i), we scan all positions from 
left to right applying the following iterative procedure: d;,(1) = b(1), d,(é+1) = 
dj,(t) + b(i) — c(i — 1), i = 2..|w|. Note that the algorithm can be extended to 
all (not necessarily primitively-rooted) k-powers. In this case, we increment 
b(£) by |e(r)/k], and we increment by 1 each c(j), for 7 = m— kp(r) + 1,m 
2kp(r) + 1,...,m— |e(r)/k|kp(r) + 1. Here, Theorem 8.2.7 guarantees that 
the number of updates is linear. Finally, note that the procedure can be easily 
modified in order to count fractional repetitions of a given exponent, as well as 
to repetitions ending (or centered) at each position. 


8.5. Finding quasi-squares in two words 


The linear-time algorithm for computing all maximal repetitions presented in 
the previous section allows us to compute all squares in a word of length n in time 
O(n + 8), where S is the number of those squares. In this section we consider 
a problem of finding quasi-squares that generalizes the problem of computing 
usual squares. Besides that the problem of quasi-squares is interesting in its 
own, it will be used later in Section 8.6.1. 

Assume we are given two words wu, v of equal length, |u| = |v] = n,n > 2. We 
say that words u,v contain a quasi-square iff for some 1 < <n and p> 0, we 
have ulé..€+p—1] = v[@+p..2+2p—1]. pis called the period of the quasi-square, 
and words ul¢..€+ p— 1], v[€+ p..€+ 2p — 1] are called respectively its left root 
and right root. Obviously, if w= v, then the quasi-squares are usual squares. 

Denote QS(u,v) the set of all quasi-squares of words u,v. We show that 
OS(u,v) can be computed in time O(nlogn + S$), where S = Card (QS(u, v)). 
The algorithm we propose is based only on longest extension functions and is 
similar to the algorithm based on Theorem 8.4.1 for finding maximal repetitions 
containing a given position. An advantage of the proposed solution is that 
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the output quasi-squares are naturally grouped into runs of quasi-squares of 
the same period starting at successive positions in the word. This runs are 
analogous to repetitions for usual non-gapped squares. We will use this feature 
of the algorithm later in Section 8.6. 

Assume n = 2m, and denote QS,,(u,v) the subset of QS(u,v) consisting 
of those quasi-squares (u[é..¢ + p — 1], v[¢é + p..€ + 2p — 1]) which contain the 
frontier in the middle of wu and of v, more precisely such that 2 << m+ 1 and 
€+2p—1>m. QSm(u, v) is the union of two subsets QS! (u,v) and QS", (u,v) 
consisting of quasi-squares containing the middle frontier in their left root or 
right root respectively. Consider the set QS!,,(u, v) (QS",(u, v) is treated simi- 
larly). QS! (u,v) consists of quasi-squares (u[é..€ + p — 1], v[é + p..€+ 2p — 1]) 
verifying m+1l—p<€<m+l. 

Consider the following longest extension functions defined on all positions 
i € [1..m] of ulm + 1..n): 


LP(i) — min{ bP yattunliabetial)) m—t+ 1} 
LS(t) = min{ LS. 11.m]lupm-+i..n] (4), i} 


In words, LP(i) is the length of the longest common prefix of u[m + 7..n] and 
ulm + 1..n], and LS(i) is the length of the longest common suffix of u[1..m] and 
ulm + Lm + i). 

Let u[é..2+p—1], v[é+p..0+ 2p —1] be a quasi-square of period p belonging 
to QS! (u,v). By an argument similar to Theorem 8.4.1, we have 


LS(p) + LP(p +1) >p. (8.5.1) 


Vice versa, if for some p € [1..m], inequality (8.5.1) holds, then there exists a 
quasi-square of period p from QS! (u,v). More precisely, the following lemma 
holds. 


LEMMA 8.5.1. For p € [l..m], there exists a quasi-square of QS! (u,v) of 
period p iff inequality (8.5.1) holds. When (8.5.1) holds, there is a family of 
quasi-squares of period p from QS! (u,v), with the left roots starting at each 
position of the interval 


[m+1—LS(p) ..m+1+min{LP(p + 1) — p,0}). (8.5.2) 


To use Lemma 8.5.1 as an algorithm for computing QS!, (u,v), we have to 
compute the values LP(i), L.S(i) for i € [1..m]. All these values can be computed 
in time O(m), as explained in Section 8.3.1. 

We conclude that all quasi-squares of QS!, (u,v) can be computed in time 
O(m + Card (QS',(u,v))). Similarly, all quasi-squares of QS}, (u,v) can be 
computed in time O(m + |QS?_ (u,v)|), and thus all quasi-squares of OS,,(u, v) 
are computed in time O(m + Card(QS,,(u,v))). A straightforward divide- 
and-conquer algorithm gives the running time O(n log n + Card (QS(u, v))) for 
finding all quasi-squares in u, v. 
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THEOREM 8.5.2. The set QOS(u,v) of all quasi-squares in words u,v of length 
n can be found in time O(nlogn + Card (QS(u, v))). 


8.6. Finding repeats with a fixed gap 


In this section, we are interested in the problem of computing all factors of a 
given word repeated within a specified distance, rather then contiguously. In 
other words, we want to find all factor occurrences uwvu, where the size of v, 
called the gap, is equal to a pre-specified constant 6. We will show that using 
the algorithm of the previous sections, all such occurrences can be found in time 
O(nlog 6+ 8), where S is the number of them. Thus, if 6 is considered constant, 
we obtain an O(n + S) time bound. 


8.6.1. Algorithm for finding repeats with a fixed gap 


Let 6 > 0 be an integer, called the gap. An occurrence in w of a factor r = uvu, 
where |u| > 0 and |v| = 6, is called a 6-gapped repeat (for short, d-repeat) in w. 
The left occurrence of wu is called the left copy of r, and the right one the right 
copy. For a 6-repeat r, the length |u| is called the copy length. The problem is 
to find all 6-repeats in a given word. 

Let w = a,---:a, be a word of length n. Without loss of generality, we 
assume that a, does not occur elsewhere in w. In this section, we use the 
Lempel-Ziv factorization (see Section 8.3.2). Let w = f1--- f% be the Lempel- 
Ziv factorization of w. 

To describe the algorithm, we need some notation. Let bo, bi,...,b¢—1, bg 
be the end positions of f;’s, that is bb) = 0, and b; = |fi--- f;| for 1 <i<k. We 
also denote ¢; = |fi|,i =1,...,k. For every i=1,...,k—1, if @; > 6, then we 
decompose f; = f/f’, where |f/| = 0. 

Let us split the set GR of all d-repeats into the set GR’ of those 6-repeats 
which contain a frontier between LZ-factors and the set GR” of the d-repeats 
located properly inside LZ-factors. We now concentrate on the 6-repeats of GR’, 
and further split GR’ into (disjoint) subsets GR), i = 1,...,k — 1, where GR} 
consists of those 6-repeats which contain the frontier between f; and f;41 but 
don’t contain the frontier between f+; and fi,;2. Furthermore, each GRi, is 
split into the following subsets: 

(a) r € GRi" iff the left copy of r contains the frontier between f; and fi+1, 
(b) r € GRi" iff the right copy of r contains the frontier between f; and fi+1, 
(c) r€ GR" iff G41 > 6 and the right copy of r contains the frontier between 
fi,, and f;/!., but does not contain the frontier between f; and fj+1, 
(d) r€ GR" iff the right copy of r contains neither the frontier between f; 
and fj+1, nor, if €;41 > 4, the frontier between f;,, and 7/4. 
Cases (a) and (b) cover the situation when the frontier between f; and fi+1 is 
contained respectively in the left and right copy of r. Otherwise, this frontier is 
contained in the gap between the copies. Cases (c) and (d) distinguish whether 
the right copy contains the frontier between f/,, and f/’,, or not, provided that 
bay >o. 
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We now consider each of cases (a)-(d) separately and show how to find the 
corresponding set of d-repeats. 


(a) Finding 6-repeats of GRi. Let r € GRi" be a d-repeat with copy 
length p. Since r does not contain the frontier between fj,; and fj+2, then 
d+ p < 41, and therefore p < €;4; —1—06. In particular, OR is empty 
whenever €;1; <6+1. Assume now that @;4; >6+1. 

Let t; be the suffix of f,--- f; of length @;,; —1—6. We define the following 
longest extension functions on all positions 7 € [1..0;41 — 1] of fi4a: 


LP(i) = DP #44 (1. bi41—1 (2) 
LS(i) = DS ¢,) fey [.li41—1) (2) 


Similarly to Theorem 8.4.1, an occurrence of a d-repeat r € oR. implies 
LS(5+p)+LP(6+p+1)>p. (8.6.1) 


Conversely, if for some p € [1..¢;41 — 1 — 6], inequation (8.6.1) holds, then there 
exists a d-repeat of GR!" with copy length p. To summarize, the following 
lemma holds. 


LEMMA 8.6.1. For p € [1..€;41 — 1 — 6], there exists a d-repeat of oR with 
copy length p iff inequality (8.6.1) holds. When (8.6.1) holds, there is a family 
of 6-repeats of GR with copy length p, starting at each position of the interval 


[b; + 1 —min{LS(5 +p), p}.. bs +1+ min{LP(5 + p+1)—p,0}]. (8.6.2) 


Lemma 8.6.1 gives a method of computing Gr. Compute the longest exten- 
sion functions LS and LP. According to Section 8.3.1, this computation can 
be done in linear time in the length of involved words, that is in time O(¢;41). 


Then, all 6-repeats of GR}"* can be computed in time O(¢;41 + Card (9Ri") ). 


(b) Finding 6-repeats of GR{"’. Consider a 6-repeat r € GR} with copy 
length p. From the definition of Lempel-Ziv factorization, it follows that the 
right copy of r starts strictly after the start of f;. On the other hand, from 
the definition of GRi", it ends strictly before the end of f;,1. Therefore, p < 
£, + bj41 — 2. 

We then proceed similarly to case (a). Using appropriate longest extension 
functions computed in time O(¢; + 41), all 6-repeats of GR" can be reported 
in time O(6; + ¢i41 + Card (GR{")). 


(c) Finding 6-repeats of GR’”™. Note that this case is defined only when 
i41 > 6. Consider a 6-repeat r € GR” with copy length p. The right copy of 
r occurs inside w[b; + 2..b; + €:41 — 1], and therefore p < ¢;,1 — 2. 

Again, using appropriate longest extension functions, all d-repeats of GR" 
can be reported in time O(¢;41 + Card (GR?"")). 
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(d) Finding 6-repeats of GR””"". Consider now a d-repeat r € GR?" with 
copy length p. Denote m; = min{d,¢@;41}. The right copy of r occurs inside 
fi41[2..m; — 1], and therefore p < mj; — 2. 

This case differs from cases (a)-(c) in that we cannot a priori select a position 
contained in the right or left copy of r. Therefore, we cannot apply directly the 
technique of longest extension functions. We reduce this case to the problem of 
finding quasi-squares, considered in Section 8.5. 

Since the start position of the right copy belongs the interval [b; + 2..b; + 
m,; — 1], the end position of the left copy belongs to the interval w[b; + 1 — 
6..b; +m; —2— 0]. Since p < m; — 2, the left copy of r is contained in the word 

Consider the word w’ = w/b; — 6 — m; + 4..b; + m; — 1 — 6]. The length 
of w’ is (2m; — 4). Let # be another fresh letter. Denote by w” the word 


LEMMA 8.6.2. There exists a d6-repeat r € on iff there exists a quasi-square 
in words w',w”. Each such quasi-square corresponds to a 6-repeat Tr € oR, 


Proof. Consider a 6-repeat r € GR?" with the left copy w[é..¢+p—1] and the 
right copy w[€+6+p..0+6+2p— 1]. The right copy is a factor of w[bj + 2..b; + 
m, — 1], and therefore starts at position (€+6+ p—b; +m, —3) in w”. The 
left copy starts at position @— (b; —-d6 —m; +4) +1=¢+6—b);+m,—3 in wv’ 
and therefore this forms a quasi-square of period p in w’ and w”. 

Inversly, assume there is a quasi-square w’|j..j-+p—1] = w"[j+p..7 + 2p—1). 
We must have [j + p..7 + 2p — 1] C [m; — 1..2m; — 4]. This implies that w[b; — 
6—m+j+3..b; —6 —m+jt+pt2] = w[bi —m;+74+3..6; —m;+j+pt I, 
and hence a 6-repeat starting at position (b; — 6 —m,;+3+ 7) in w. rT 


In view of Theorem 8.5.2, all quasi-squares in w’,w” can be found in time 
O(m; log m,;). We conclude that all d-repeats of oR can be reported in time 


O(m; log m; + Card (gR"")). Using m; = min{6, £;41}, rewrite this bound as 
O(6;41 log 6 + Card (gR7"") ). 


Putting together cases (a)-(d), all 6-repeats of GR can be found in time 
O(6;) + O(Li41 log 5) + O(Card (GR{)). Summing up over all i = 1..k, we obtain 
that all d-repeats of GR’ can be found in time O(n log 6 + Card (GR’)). 

Finding 6-repeats of GR” can be done using a technique similar to the second 
stage of MAXIMAL-REPETITIONS from Section 8.4. The key observation here is 
that each 6-repeat of GR” occurs inside some factor fj (i.e. does not contain 
positions b; + 1 and 6:41). By definition of the factorization, each such d-repeat 
is a copy of another 6-repeat occurring to the left. When constructing the 
Lempel-Ziv factorization, we store, for each factor f; = va, a reference to an 
earlier occurrence of v. 6-repeats occurring in f; are located inside v, and are 
retrieved from its copy by the method used in the second stage of MAXIMAL- 
REPETITIONS. The running time of this stage is O(n + Card (GR")). 
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We conclude with the final result of this section. 


THEOREM 8.6.3. The set GR of all 6-repeats in a word of length n can be 
found in time O(n log 6 + Card (GR)). 


8.6.2. Finding 6-repeats with fixed gap word 


The algorithm of Section 8.6.1 can be modified in order to find all 6-repeats 
with a fixed word between the two copies. Assume v is a fixed word of length 6. 
Denote by GR, the set of d-repeats of the form wuu, where |u| > 1. We show 
that all those repeats can be found in time O(n log 6+ Card (GR,)). To do that, 
we first find, using any linear-time string matching algorithm (for example, the 
Knuth-Morris-Pratt algorithm) all start occurrences of v in w. For each position 
i of v, we compute the position next(2), defined as the closest start position of 
v to the right of 7. 

From the algorithm of Section 8.6.1 for finding the set GR’, it should be 
clear that all the 6-repeats of GR’ can be represented by O(n log 6) families 
each consisting of 6-repeats with a given copy length and starting at all positions 
from a given interval. In other words, each family can be specified by an interval 
[i..7] and a number p, and encodes all 5-repeats with copy length p starting at 
positions from [i..j]. 

From this description, using function next(i), we can easily extract all 5- 
repeats of GR, in time proportional to the number of those. For that, we first 
assume that each family is specified by the interval of start positions of the gap 
between the left and right copies (as the copy length p is known for each family, 
the translation can be trivially computed by just adding p to the interval of start 
positions). Then we process all the families and extract from each interval those 
positions which are start positions of an occurrence of v. Using function nect, 
this can be easily done in time proportional to the number of such positions. 

After processing all families, we have found all d-repeats from the set GR', = 
GRyAGR’ in time O(n log 6+ Card (GR',)). Then, using a procedure for finding 
-repeats from GR”, described in Section 8.6.1, we find all 5-repeats from GR” = 
GRyAGR" in time O(n + Card (GR{)). As GRy =GR,UGR), all 6-repeats 
from GR, are found in time O(nlog 6 + Card (GR,)). 


8.7. Computing local periods of a word 


In this section we focuse on the important notion of local periods, that charac- 
terize a local periodic structure at each location of the word. The local period 
at a given position is the root size of the smallest square centered at this posi- 
tion. An importance of local periods is evidenced by the fundamental Critical 
Factorization Theorem that asserts that there exists a position in the word (and 
a corresponding factorization), for which the local period is equal to the global 
period of the word. 

Consider a word w = ay,---d,y over a finite alphabet. Let w = uv be a 
factorization of w such that |u| = 7. We say that a non-empty square xa is 
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centered at position i of w (or matches w at central position i) iff the following 
conditions hold: 

(i) x is a suffix of u, or u is a suffix of x, 

(ii) x is a prefix of v, or v is a prefix of x. 
In the case when z is a suffix of u and 2 is a prefix of v, we have a square 
occurring inside w. We call it an internal square. If v is a proper prefix of 
x (respectively, u is a proper suffix of x), the square is called right-external 
(respectively, left-external). 

The smallest square centered at a position 7 of w is called the minimal local 
square (hereafter simply minimal, for shortness). The local period at position 
i of w, denoted MLP,,(i), is the period of the minimal square centered at this 
position!. 

Note that for each position i of w, MLP,,(i) is well-defined, and 1 < 
MLP,,(i) < |w|. The relation between local periods and the (global) period 
of the word is established by the fundamental Critical Factorization Theorem. 


THEOREM 8.7.1 (Critical Factorization Theorem). For each word w, there ex- 
ists a position i (and the corresponding factorization w = uv, |u| = 7) such that 
MLP,,(i) = p(w). Moreover, such a position exists among any p(w) consecutive 
positions of w. 


In this section, we show how the techniques of the previous sections can be 
used to compute ail local periods in a word in time O(n), assuming a constant- 
size alphabet. The method consists of two parts. We first show, in Section 8.7.1, 
how to compute all internal minimal squares. Then, in Section 8.7.2 we show 
how to compute left- and right-external minimal squares, in particular for those 
positions for which no internal square has been found. Both computations will 
be shown to be linear-time, and therefore computing all local periods can be 
done within linear time too. 


8.7.1. Computing internal minimal squares 


Finding internal minimal squares amounts to compute, for each position of the 
word, the smallest square that is centered at this position and occurs entirely 
inside the word, provided that such a square exists. Thus, throughout this 
section we will be considering only squares occurring inside the word and, for 
the sake of brevity, omit the adjective “internal”. 

The general approach is to use the algorithm for computing maximal repe- 
titions from Section 8.4 in order to retrieve squares which are minimal for some 
position. One modification of MAXIMAL-REPETITIONS we make here is that we 
use the s-factorization with non-overlapping copies (see Section 8.3.2) instead 
of the regular s-factorization, used in Section 8.4. This modification, however, 
does not affect any properties of MAXIMAL-REPETITIONS, including its linear 
time bound. 


1Note that the period of a square x is |x| and not the minimal period of word xa which 
can be smaller. 
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Let us now focus on the first stage of MAXIMAL-REPETITIONS. At this step, 
we find, for each s-factor f; = w[bj—1+1..6;], all maximal repetitions that start 
before b;-; and end inside f;. According to Corollary 8.1.4, for each possible 
period p, there can be at most two such repetitions. Each maximal repetition is 
a run of squares occurring at successive positions in the word. For our purpose 
here, it will be convenient for us to think of a repetition as an interval of center 
positions of squares it contains. 

In this section, we will need to compute, for a maximal repetition, a subrun 
of squares it contains, that is a subinterval of the corresponding interval of center 
positions. To illustrate this, consider Theorem 8.4.1. The maximal repetition 
found according to the theorem corresponds to squares centered at positions 
[m+ p— LSzy(p) .. m+ LPy(p+ 1)]. If we want to compute only squares 
centered at positions greater than or equal to m and starting at positions less 
than or equal to m (as it will be the case below), the interval of centers should 
be restricted to [m + max{p — LSz),(p),0} .. m+ min{LP,(p + 1),p}]. A 
similar interval restriction has been done in the previous section for computing 
6-repeats. 

We present now a linear-time algorithm for computing all internal minimal 
squares in a given word w. The general description of the algorithm is as follows. 
First, we compute, in linear time, the s-factorization of w with non-overlapping 
copies and keep, for each factor fj, a reference to its non-overlapping left copy. 
Then we process all factors from left to right and compute, for each factor f,;, 
all minimal squares ending in this factor. For each computed minimal square, 
centered at position 7, the corresponding value M LP,,(i) is set. After the whole 
word has been processed, positions 7 for which values MLP,,(i) have not been 
assigned are those for which no internal square centered at i exists. For those 
positions, minimal squares are external, and they will be computed at the second 
stage, presented in Section 8.7.2. 

Let f; = wl[bj-1 + 1..b;] be the current factor, ¢; = 6; — b;~1, and let 
wl[bj-1 + 1—A, .. b; — Aj] be its non-overlapping left copy (ie. A; > 2@;). 
If for some position b;_; +2, 1 < 7 < @;, the minimal square centered at 
b;-1 +7 occurs entirely inside f;, that is MLP,,(bj-1 +7) < min{i, ¢; — i}, then 
MLP,,(b;-1 +1) = MLP,,(bj-1 +i—A,). Note that MLP,,(b;-1 +i—A,) has 
been computed before, as the minimal square centered at b;_; +7 ends before 
the beginning of f;. Based on this observation, we retrieve, in time O(|f;|), all 
values MLP,,(b;-1 + 7) which correspond to squares occurring entirely inside 
f;. Therefore, it remains to find those values ML P,,(b;~1 +7) which correspond 
to minimal squares that end in f; and extend to the left beyond the frontier 
between f; and f;_-1. 

To do this, we use the technique of computing runs of squares from Sec- 
tion 8.4. The idea is to compute all candidate squares and test which of them 
are minimal. However, this should be done carefully as this can break down the 
linear time bound, due to a possible super-linear number of all squares. The 
main trick is to keep squares in runs and to show that there is only a linear 
number of individual squares which need to be tested for minimality. 

We are interested in squares starting at positions less than or equal to b;—1 
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and ending inside f;. All these squares are divided into those which are centered 
inside f; and those centered to the left of f;. Two cases are symmetrical and 
therefore we concentrate on squares centered at positions [b;—1..6;-1+4;—1]. We 
compute all such squares in the increasing order of periods. For each p € [1..£;] 
we compute the run of all squares of period p centered at positions belonging 
to the interval [b;-1..b;-1 + @; — 1], starting at a position less than or equal 
to bj-1, and ending inside f;, as explained above. Assume we have computed 
a run of such squares of period p, and assume that q < p is the maximal 
period value for which squares have been previously found. If p > 2q, then we 
check each square of the run whether it is minimal or not by checking the value 
MLP,,(b;-1 +7). If this square is not minimal, then MLP,,(b;-; +7) has been 
already assigned a positive value before. Indeed, if a smaller square centered at 
b;—1 +7 exists, it has necessarily been already computed by the algorithm (recall 
that squares are computed in the increasing order of periods). If no positive 
value MLP,,(b;-1 +7) has yet been set, then we have found the minimal square 
centered at 6;-; +7. Since there is at most p considered squares of period p 
(their centers belong to the interval [b;—~1..b;-1 + p — 1]), checking all of them 
takes at most 2(p — q) individual checks (as q < p/2 and p— q > p/2). 

Now assume p < 2q. Consider a square sg = w[cg — q+ 1..cq + q] of period 
q and center cg, which has been previously found by the algorithm (square of 
period q in Figure 8.4). We now prove that we need to check for minimality 
only those squares s, of period p which have their center c, verifying one of the 
following inequalities : 


\Cp — Cq| <p—4q, or (8.7.1) 
Cp = Cg tq (8.7.2) 


In words, cp is located either within distance p — q from cg, or beyond the end 
of square Sq. 


LEMMA 8.7.2. Let 8, = wlcp — p+ 1..cp +p] be the minimal square centered 
at some position cp. Let 8g = w[¢q —q+1..cq+4] be another square with q < p. 
Then one of inequations (8.7.1),(8.7.2) holds. 


Proof. By contradiction, assume that neither of them holds. Consider the case 
Cp > Cq, CASC Cy < Cg is symmetric. The situation with c, > cg is shown in 
Figure 8.4. Now observe that word wlcg + 1..cp] has a copy wleg — q+ 1..cp — 
q| (shown with empty strip in Figure 8.4) and that its length is (cp — cq). 
Furthermore, since c, — cg > p — q (as inequation (8.7.1) does not hold), this 
copy overlaps by p — q letters with the left root of s,. Consider this overlap 
wlcp — p+ 1..cp — q| (Shadowed strip in Figure 8.4). It has a copy w[cp + L..cp + 
(p—q)| and another copy w[c, — (p—q) + 1..cp] (see Figure 8.4). We thus have a 
smaller square centered at c,, which proves that square s, is not minimal. 7 


By Lemma 8.7.2, we need to check for minimality only those squares sp 
which verify, with respect to s,, one of inequations (8.7.1),(8.7.2). Note that 
there are at most 2(p—q) squares s, verifying (8.7.1), and at most p—q squares 
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Figure 8.4. Case where neither of inequations (8.7.1),(8.7.2) holds (sub- 
case Cp > Cq) 


Sp verifying (8.7.2), the latter because s, must start before the current factor, 
i.e. Cp < b;-1 +p. We conclude that there are at most 3(p—q) squares of period 
p to check for minimality, among all squares found for period p. Summing up 
the number of all individual checks results in a telescoping sum, and we obtain 
that processing all squares centered in the current factor can be done in time 
O((f))) 

The RIGHT-LOCAL-SQUARES algorithm for computing minimal squares cen- 
tered inside f; is given below. It is based on algorithms RIGHT-REPETITIONS 
and MAXIMAL-REPETITIONS from Section 8.4. In particular, t; is defined as in 
the MAXIMAL-REPETITIONS algorithm. 

A similar algorithm applies to the squares centered on the left of f;. Note 
that after processing f;, all minimal squares ending in f; have been computed. 

To summarize, we need to check for minimality only O(|f;~1|+|f;|) squares, 
among those containing the frontier between f; and f;~1, each check taking a 
constant time. We also need O(|f;|) time to compute minimal squares occurring 
inside f;. Processing f; takes then time O(|f;—1|-+ |f;|) overall, and processing 
the whole word takes time O(n). 


THEOREM 8.7.3. All internal minimal squares in a word of length n can be 
computed in time O(n). 
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RIGHT-LOCAL-SQUARES( f;) 
1 pb computing minimal squares centered inside f; 
2 LS;,)p, < LONGEST-SUFFIX-EXTENSION(t;, fi) 
3. LPs, — LONGEST-PREFIX-EXTENSION( f;) 
4 q+0O 

5 for p<—1to ¢; do 

6 t, — suffix of fi --- fi-1 of length (2| f,-1| + max{2| fi—al, | fil}) 

7 LS(p) — min{ LS.) 7,(p), P} 

8 LP(p +1) — min{LP;,(p +1), p} 


9 if LS(p) + LP(p+1) >p then 
10 I< [p- LS(p)..LP(p +1)] > interval of square centers 
11 if p < 2q then 
12 a — Cq — 55-1 
13 Le 1i(le, == 9).-4, + w= @)) U le, + e2y = 1) 
14 for each 7 € I do 
15 if MLP,,(bj;-1 +7) is undefined then 
16 MLP,(b;-1 +7) <p 
17 q<—p 
18 Cq — 07-1 +p—LS(p) 


8.7.2. Computing external minimal squares 


The algorithm of the previous section allows to compute all internal minimal 
squares of a word. Here we show how to compute external minimal squares for 
those positions which don’t have internal squares centered at them. 

Consider a word w of length n. We first consider squares which are right- 
external but not left-external. Those squares are centered at positions in the 
right half of the word. The case of squares which are left-external but not 
right-external is symmetrical. 

For each position 7 in the right half of w, we compute a value RS(i) equal to 
the period of the smallest right-external and not left-external square centered at 
i, provided such a square exists. We show that all values R.S(7) can be computed 
in linear time using longest extension functions (Section 8.3.1). 

Consider a right-external square of period p centered at some position 7 € 
[[n/2]...—1], where n—i < p <i. Observe that w[i—p+l..n—p] = wli+1..n]. 
This implies that LS,,(n — p) > n—i. Conversly, if for some p € [l..n — 1], 
LS,(n — p) > 0, then there exists a family of squares of period p centered at 
positions 7 € [n— LS,,(n—p)..n—1]. For i > n—p, the square is right-external, 
otherwise it is internal. 

This implies the following algorithm for computing minimal right-external 
squares. Compute LS, for all positions of w. For each j € [1..n], set LN'S,,(7) = 
LSw(j) if LSu(g) < n—-j, and LNS,(j) = n — 7 — 1 otherwise. For each 
center position 7 € [[n/2]..n — 1], we need to compute the minimal p such that 
LNSy(n—p) >n-1. 

Consider all pairs (7, LNS,,(j)) for 7 € [1..n]. If for some pair (j, LNS.,(J7)), 
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there exists a pair (7’, LN'S,,(j’)) such that LNS,,(j’) > LNS,(j) and j’ > J, 
then (j, DNS (j)) carries no useful information for computing minimal right- 
external squares. We then delete all such pairs (j, LN S.,(j)) from consideration 
by looping through all 7 from n to 1 and deleting those for which the value 
LNS,,(j) is smaller than or equal to max; +;{LNS,,(j’)}. We then sort the 
remaining pairs (j, LNS,,(j)) in the decreasing order of LN'S',,(j). Using bucket 
sort, this can be done in O(n) time and space. 

We now set the values RS(7) as follows. For the first element (jo, LN Sw(jo)) 
of the list, we set RS(i) = 0 for all i € [[n/2]..r — LNS(jo) — 1]. We then 
scan through the ordered list of pairs and for each element (j, LNS.,(j)), look 
at the next element (j’, LN Sw(j’)), LNSw(j) > LNSw(j'). For all i € [n — 
LNS,,(j)..n-— LNSw(j’) — 1], set RS(i) =n — 7. We then have the following 


LEMMA 8.7.4. For each i € [[n/2]..n — 1], RS(2) is the smallest period of 
a right-external square centered at i if such a square exists, and RS(i) = 0 
otherwise. 


We now turn to squares that are both right-external and left-external. Con- 
sider such a square of period p, centered at some position i. Observe that 
w(l..n—p] = w[p+1..n]. Therefore, there exists a border of w of sizen—p < n/2. 
The largest border corresponds to smallest square. On the other hand, the pe- 
riod of this square is equal to the minimal period p(w) of w. Note that, in 
general, each local period cannot be greater than p(w), and all minimal squares 
which are both right-external and left-external have the period equal to p(w). 

It is well known that p(w) can be easily computed in linear time (see Chap- 
ter 1). We then obtain an O(n) algorithm for computing all minimal squares: 
first, using Theorem 8.7.3 we compute all minimal internal squares; then, using 
Lemma 8.7.4 and the above remark, we compute the minimal external squares 
for those positions for which no internal square has been found at the first stage. 
This proves the main result. 


THEOREM 8.7.5. For a word of length n, all local periods MLP,,(i) can be 
computed in time O(n). 


8.8. Finding approximate repetitions 


In many practical applications, such as DNA sequence analysis, considered repe- 
titions admit a certain variation between copies of the repeated pattern. In other 
words, repetitions under interest are approximate repetitions and not necessarily 
exact repetitions only. Computing approximate repetitions is the subject of this 
section. 

The simplest notion of approximate repetition is an approximate square. An 
approximate square in a word is a factor uv, where u and v are within a given 
distance k and the distance measure could be one of those usually used in prac- 
tical applications, such as Hamming distance or Levenshtein (or edit) distance. 
Here we focus on the Hamming distance, when the variation between repeated 
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copies can be only letter replacements. An important motivation here is to de- 
fine structures encoding families of approximate squares, analogous to maximal 
repetitions in the exact case. In Section 8.8.1, we define two basic structures that 
we call K -repetitions and K-runs, where K the number of allowed errors. In Sec- 
tion 8.8.2, we show that all K-repetitions can be found in time O(nK log K +S), 
where S is their number. In Section 8.8.3 we show that the same bound holds 
for K-runs: all of them can be found in time O(nK log K + R), where R is their 
number. The latter result implies, in particular, that all approximate squares 
can be found in time O(nK log K + T) (T their number). All those algorithms 
require only O(n) of working space. 


8.8.1. K-repetitions and K-runs 


Let h(-,-) be the Hamming distance between two words of equal length, that is 
h(u,v) is the number of mismatches (letter differences at corresponding posi- 
tions) between u and v. For example, h(baaacb, bcabcb) = 2. 

A word s = uv, such that |u| = |v], is called a K-square iff h(u,v) < K. 
Reusing the terminology of the exact case, we call p = |u| = |v| the period of s, 
and words u, v the left and right root of s respectively. 

We now want to define a more global structure which would be able to 
capture “long approximate repetitions”, generalizing repetitions of arbitrary 
exponent in the exact case. As opposed to the exact case, Conditions (i)-(ii) 
of Proposition 8.1.1 generalize to different notions of approximate repetition. 
Condition (i) gives rise to the strongest of them: A word r[1..n] is called a 
K-repetition of period p, p < n/2, iff h(r[1..n — pl,r[p + 1..n]) < K. 

Equivalently, a word r[1..n] is a K-repetition of period p, if the number of 
i such that r[i] 4 r[i + p| is at most K. For example, abaa abba cbba cb is a 2- 
repetition of period 4. abcabc abc abd abd abd abd abd is a 1-repetition of period 
3 but abc abc abc abb abc abc abc abb is not. 

Another point of view, expressed by Condition (ii) of Proposition 8.1.1, 
considers a repetition as an encoding of squares it contains. Projecting this 
to the approximate case, we come up with the notion of run of approximate 
squares: A word r{1..n] is called a run of K-squares, or a K-run, of period p, 
p < n/2, iff for every i € [l..n — 2p+ 1], the factor s = rji..i+2p— lisa 
K-square of period p. 

Similarly to the exact case, when we are looking for approximate repetitions 
occurring in a word, it is natural to consider maximal approximate repetitions. 
Those are repetitions extended to the right and left as much as possible pro- 
vided that the corresponding definition is still verified. Note that the notion of 
maximality applies both to K-repetitions and to K-runs: in both cases we can 
extend each repetition to the right and left as long as it verifies the corresponding 
definition. We will always be interested in maximal K-repetitions and K-runs, 
without mentioning it explicitly. Note that for both definitions, the maximality 
requirement implies that if r = w[i..j] is an approximate repetition of period p 
in w[l..n], then w[j+1] 4 wljy+1—p] (provided j < n) and w[i—1] 4 w[i—14+p] 
(provided i > 1). Furthermore, if w[i..j] is a maximal K-repetition, it contains 
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> 2p 


a a re a rr a 


a; # b; j3=0,1,...,K 41 


Figure 8.5. Maximal K-repetition 


Figure 8.6. Maximal K-run 


exactly K mismatches w[é] 4 w[é+ pl, i < €,€+p < j, unless the whole word w 
contains less than AK mismatches (to simplify the presentation, we exclude this 
latter case from consideration). 

Figure 8.5 illustrates the definition of (maximal) K-repetitions and Fig- 
ure 8.6 that of (maximal) K-run. 


EXAMPLE 8.8.1. The following Fibonacci word contains three 3-runs of of pe- 
riod 6. They are shown in regular font, in positions aligned with their occur- 
rences. Two of them are identical, and contain each four 3-repetitions, shown 
in italic for the first run only. The third run is a 3-repetition in itself. 


010010 100100 101001 010010 010100 1001 


10010 100100 101001 
10010 100100 10 
0010 100100 101 
10 100100 10100 
0 100100 101001 
1001 010010 010100 1 
10 010100 1001 


In general, each K-repetition is a factor of a K-run of the same period. On 
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the other hand, a K-run in a word is the union of all K-repetitions it contains. 
Observe that a K-run can contain as many as a linear number of K-repetitions 
with the same period. For example, the word (000 100)” of length 6n is a 1-run 
of period 3, which contains (2n — 1) 1-repetitions. 

In general, the following lemma holds. 


LEMMA 8.8.2. Let w[l..n] be a K-run of period p and let s be the number 
of mismatches wii] A w[i +p], 1 < 1,i+p < n (equivalently, s = h(w[l..n — 
pl], w[p + 1..n])). Then w contains (s — K + 1) K-repetitions of period p. 


K-runs and K-repetitions provide respectively the weekest and strongest no- 
tions of repetitions with K mismatches, and therefore “embrace” all practically 
relevant repetitions. 


8.8.2. Finding K-repetitions 


In this subsection we describe how to find, in a given word w, all maximal 
K-repetitions occurring in w (K is a given constant). 

We assume we fixed a minimal bound pp for the period of repetitions we are 
looking for. For example, po can be taken to be K +1 having in mind that if 
a period p < K is allowed, then any factor of length 2p would be a K-square. 
This assumption, however, is pragmatic and does not affect the method nor the 
complexity bounds. 

From a general point of view, we are going to apply the same approach as 
the one used in Sections 8.4-8.7. However, the case of approximate repetitions 
is more complex and requires a number of modifications. 


We start with describing the modification of the basic problem of finding rep- 
etitions containing a given position of the word (Theorem 8.4.1 of Section 8.4). 
Recall that a factor w[#..j] of w is said to contain a position @ of w iff i < @ <j. 
A factor w[t..j] is said to touch a position ¢ of w iffi-1<@< j+1. Here, 
it will be convenient for us to specify the problem as follows: Given a word 
w(1..n] and a distinguished position ¢ € [2..n — 1], find all K-repetitions in w 
that touch ¢. Similar to Section 8.4, we distinguish two (non-disjoint) classes of 
K-repetitions according to whether they have a root on the right or on the left 
of position @. We focus on K-repetitions of the first class, those of the second 
class are found similarly. 

To apply a method similar to the one of Theorem 8.4.1, we need a general- 
isation of longest extension functions (Section 8.3.1) that compares factors up 
to a given Hamming distance. Formally, given a word w and a position @, for 
every k = 0,..., 4, we compute the following functions on p € [po..n — ¢]: 


LP“) (p) = max{j | h(w[é+p.£+ p+ j—l,wll+1.£4 jl) < k},(8.8-1) 
LS) (p) = max{j | h(w[€+p—j+1.+ p], wl — 7 + 1.4) < k}.(8.8.2) 


A (p) is the length of the longest factor of w starting at position ¢+ p and 
equal, within k mismatches, to the factor of the same length starting at @+ 1. 
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LS") (p) is the length of the longest factor ending at position @+ p and equal, 
within & mismatches, to the factor ending at position @. These functions are 
generalizations of longest extension functions considered in Section 8.3.1 and 
can be computed in time O(nk) using suffix trees combined with the lowest 
common ancestor computation in a tree (see Gusfield 1997). 

Consider now a K-repetition r of period p that has a root on the right of a 
position £. Note that position + p of w is contained in r, and that r is uniquely 
defined by the number of mismatches w[i + p] 4 wii], i > + 1, occurring 
in r. Let & be the number of those mismatches. The following theorem is a 
generalization of Theorem 8.4.1. 


THEOREM 8.8.3. Let w be a word of length n and let €, 1 < €< n, bea 
distinguished position of w. There exists a K-repetition of period p which 
touches position ¢, and has a root on the right of ¢, iff for some k € [0..K], 


LS (p) + LPS Lp + 1) > p. (8.8.3) 


When (8.8.3) holds, this repetition starts at position (¢ - Lee @) + 1) and 
ends at position (¢ +pt+ Ley (r)) : 


Theorem 8.8.3 provides an O(nk) algorithm for finding all considered 
K-repetitions: compute longest extension functions (8.8.1), (8.8.2) (this takes 
time O(nF)) and then check inequation (8.8.3), for each k = 0,...,K and all 
p € [po..€—1] (this takes time O(nK) too). Every time the inequation is verified, 
a K-repetition is identified. The computation is summarized in the following 
algorithm: 


MISMATCH-RIGHT-REPETITIONS(w, €) 
1 p Find K-repetitions of w which have a root on the right of position @ 
2 forallk=0,...,K do 

3 > compute longest extension functions (8.8.1), (8.8.2) 

4 ies <— MISMATCH-PREFIX-EXTENSION(w, £, k) 

5 LP“ < MISMATCH-SUFFIX-EXTENSION(w, £, k) 

6 RD 

7 for p< po to min{n— @+1,n/2} do 

8 for k —0 to K do 

9 if LP“) (p +1) + LS) (p) > p then 

10 ro (€- LSE) +1,£+p+ LP (p)) 

11 R—_Ru {r} 

12 return R 


Finding repetitions having a root on the left of position @ is a symmetric 


problem that can be solved within the same time bound (hereafter referred to 
as algorithm MISMATCH-LEFT-REPETITIONS). 
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We are now ready to describe an extension of the algorithm MAXIMAL- 
REPETITIONS from Section 8.4 to compute all K-repetitions in a word w. Con- 
sider the Lempel-Ziv factorization w = fi fo--- fm (Section 8.3.2). The last 
position of an LZ-factor f; will be called the head of f;. The algorithm consists 
of three stages. The first stage is based on the following two lemmas. 


LEMMA 8.8.4. The suffix of length p of a K-repetition of period p cannot 
contain K + 1 consecutive LZ-factors. 


Proof. Each LZ-factor contained in the suffix of length p of a K-repetition 
must contain at least one mismatch with the letter located p positions to the 
left. Indeed, if it does not contain a mismatch, it has an exact copy occurring 
earlier, which contradicts the definition of the Lempel-Ziv factorization. All 
those mismatches belong to the repetition and there are at most K of them. 
Therefore, the suffix of length p contains at most K LZ-factors. rT 


Divide w into consecutive blocks of (K +2) LZ-factors. Let w = By--- Bm, 
be the partition of w into such blocks. The last letter of block B; will be called 
the head of this block. At the first stage, we find, for each block B;, those 
K-repetitions which touch the head of B;_; but do not touch that of B;. The 
following lemma is analogous to Theorem 8.4.2. 


LEMMA 8.8.5. Assume that a K-repetition r touches the head of B;_1 but not 
that of B;. Then the length of the prefix of r which is a suffix of B,--- B,_1 is 
bounded by |B;| + 2|By_-1]. 


Proof. Lemma 8.8.4 implies that the suffix of r of period length cannot start 
before the first letter of B;_1. Therefore, the period of r is bounded by |B;_-1B;|. 
On the other hand, by an argument similar to Lemma 8.8.4, r cannot extend 
by more than one period to the left of B;-;. This is because otherwise each 
of the LZ-factors of B;_1, except possibly the last one, would correspond to a 
mismatch in r, and thus r would contain at least (K +1) mismatches which 
is a contradiction. Therefore, the length of the prefix of r which is a suffix of 
By onal Bij_2 is at most |By_1| + |B; |. a 


Based on Lemma 8.8.5, we apply MISMATCH-RIGHT-REPETITIONS and MIs- 
MATCH-LEFT-REPETITIONS algorithms: Consider the word w; = v;B;, where vu; 
is the suffix of By --- B;_1 of length (2|B;-1|+|B;|). Then find, by MISMATCH- 
RIGHT-REPETITIONS and MISMATCH-LEFT-REPETITIONS, all K-repetitions in 
w; touching the head of B;-; and discard those which touch the head of B;. 
The resulting complexity is OU (|Bi-1| + |Bil)). 

After processing all blocks, we find all repetitions touching block heads. 
Observe that repetitions resulting from processing different blocks are distinct. 
Summing up over all blocks, the resulting complexity of the first stage is O(nK). 
The repetitions which remain to be found are those which lie entirely within a 
block — this is done at the next two stages. 

At the second stage we find all K-repetitions inside each block B; which 
touch factor heads other than the block head (i.e. the head of the last LZ-factor 
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of the block). For each B;, we proceed by the following divide-and-conquer 
procedure: 


PrROcEsSS-BLOCK(B) 
1 et Compute K-repetitions which touch factor heads inside a block B 
2 R-O 
3 divide B= fifi41--- fizs into two sub-blocks 
B= fi: ‘fi s/2| and BY = fi s/2|41 ++ firs 
4 pb compute K-repetitions which touch the head of B’ 
5 h+« position of the head of B’ 
6 Ry — MISMATCH-RIGHT-REPETITIONS(B, h) 
7 Rey <— MISMATCH-LEFT-REPETITIONS(B, h) 
8 R-RU Ry U Ro 
9 »p process recursively B’ and B” 
10 R’ — PROCESS-BLOCK(B’) 
11 R” — PROCESS-BLOCK(B”) 
12 R—RUR' UR" 
13 return R 


The algorithm PROCESS-BLOCK has [log, A’] levels of recursion, and since 
at each step the word is split into disjoint sub-blocks, the whole complexity of 
the second stage is O(nK log K). 

Finally, at the third stage, it remains to compute the K-repetitions which 
occur entirely inside each Lempel-Ziv factor, i.e. don’t contain its first position 
and don’t touch its head. By definition of Lempel-Ziv factorization, each LZ- 
factor without its head has a (possibly overlapping) copy on the left. Therefore, 
each of these K-repetitions has another occurrence in that copy. Using this 
observation, these K-repetitions can be found using the same technique as at 
the second stage of MAXIMAL-REPETITIONS: during the construction of the 
Lempel-Ziv factorization we keep, for each LZ-factor wa, a pointer to a copy 
of w on the left. Then process all LZ-factors from left to right and recover 
repetitions occurring inside each LZ-factor from its left copy in the same way 
that it was done at the second stage of MAXIMAL-REPETITIONS. The complexity 
of this stage is O(n + S), where S' is the number of repetitions found. 

The following theorem summarizes this section. 


THEOREM 8.8.6. All K-repetitions in a word of length n can be found in time 
O(nK log K + S') where S is the number of K-repetitions found. 


The algorithm K-REPETITIONS given below summarizes the three stages of 
the computation of K-repetitions. 
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K-REPETITIONS(w) 
1 (fi,-+:, fm) <— Lempel-Ziv factorization of w 
2 partition fi,---, fm into blocks B,--- By, of K +2 consecutive factors 
3 > first stage 
4 RO 
5 fori« 2 tom’ do 
6 UE <— suffix of By +++ By4 of length (2|Bi-1| + |B; |) 
7 
8 


£— |vj| 
Ri. — MISMATCH-RIGHT-REPETITIONS(v; Bj, £) 
9 Ri — MISMATCH-LEFT-REPETITIONS (0; Bi, £) 
10 R-RURLUR! 


11 pb second stage 

12 fori-1tom’ do 

13 Ri; — PROCESS-BLOCK(B;) 

14 R-RUR; 

15 p third stage 

16 for each factor f; do 

17 retrieve all K-repetitions which occur entirely inside f; using a pro- 
cedure similar to the second stage of MAXIMAL-REPETITIONS 


8.8.3. Finding K-runs 


We now describe an algorithm for finding all K-runs in a word. The general 
structure of this algorithm is the same as algorithm K-REPETITIONS — it has 
three stages playing similar roles. However, the case of K-runs will require a 
considerable modification and additional algorithmic techniques, especially at 
the third stage. 

At the first and second stages, the key difference is the type of objects we are 
looking for: instead of computing K-repetitions we now compute subruns of K- 
squares. Formally, a K-subrun is a family of K-squares occurring at successive 
positions. In other words, a K-subrun is a K-run which is not necessarily 
maximal. In this section, we identify a subrun with the interval of end positions 
of the squares it contains (note that a different convention has been adopted in 
Section 8.7.1 where we identified a subrun with the interval of central positions 
of its squares). 

At each point of the first and the second stages when we search for repetitions 
touching some head position £, we now compute subruns containing those K- 
squares which touch ¢, i.e. A-subruns belonging to the interval [¢ — 1..¢+ 2p]. 
As in the case of K-repetitions, we split those squares into those having a root 
on the left of ¢ (belonging to the interval [¢ — 1..¢+ p— 1]) and those having 
a root on the right of ¢ (belonging to the interval [¢ + p..€ + 2p]). Note that 
here, however, these two cases are disjoint. The modification of the MISMATCH- 
RIGHT-REPETITIONS algorithm is the algorithm MISMATCH-RIGHT-SUBRUNS 
below. It computes the subruns of K-squares touching w[é] and having a root 
on the right of it. 
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MISMATCH-RIGHT-SUBRUNS(w, £) 
1 p Find subruns of K-squares touching position @ in w 
> and having the right root on the right of @ 
for allk =0,...,K do 
> compute longest extension functions defined by (8.8.1), (8.8.2) 
1p MISMATCH-PREFIX-EXTENSION(w, 4, k) 


2 

3 

4 we 

5 i <— MISMATCH-SUFFIX-EXTENSION(w, £, k) 
6 L£< empty list 

7 for p< po to min{n — £+1,n/2} do 

8 for k —0 to K do 

9 


if LP (p +1) + LS) (p) > p then 
10 lbound(p, k) — max{¢+ 2p — LS) (p), +p} 
11 rbound(p, k) — min{@+ p+ oP (p), + 2p} 
12 r — subrun (lbound(p, k), rbound(p, k)) of period p 
13 if rbound(p, k — 1) is defined and 

lbound(p, k) < rbound(p,k — 1)+1 then 

14 merge r with the subrun computed for k — 1 
15 else add r to £ 
16 else rbound(p, k) — undefined 
17 return £ 


A major additional difficulty in computing K-runs is that we have to assem- 
ble them from subruns. To perform the assembling, we need to store subruns in 
an additional data structure that allows to maintain links between subruns and 
to merge adjacent subruns into bigger runs. Finally, we have to ensure that the 
number of subruns we come up with and the work spent on processing them do 
not increase the resulting complexity bound. 

The assembling occurs already in the function MISMATCH-RIGHT-SUBRUNS, 
as intervals (subruns) found for different values of k (for-loop at line 8) may 
overlap or immediately follow each other, in which case we join them into a 
bigger subrun (lines 13-14). Moreover, those intervals can be disjoint, in which 
case we organize them in a linked list. A more formal description of the data 
structure will be given later. 

The list of subruns of K-squares that touch wl] and have a root on the left 
of it is computed similarly. Once computed, it has to be concatenated with 
the list computed by MISMATCH-RIGHT-SUBRUNS and the rightmost subrun of 
left-rooted squares has to be merged with the leftmost subrun of right-rooted 
squares, if those subruns are adjacent. 

We now describe the three stages of the algorithm in more details. Given 
an input word w, we compute the Lempel-Ziv factorization w = f1--- fr. Un- 
like the case of K-repetitions, we use here the Lempel-Ziv factorization with 
non-overlapping copies (see Section 8.3.2). We then divide the factorisation 
into blocks B,,..., Bm, each containing (K + 2) consecutive LZ-factors. At 
the first stage, we compute subruns of those K-squares which touch the 
heads of all blocks B,,..., Bm. For each block B;, we find the subruns of 
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K-squares which touch the head of B; but not that of B;,,;. This is done 
using algorithm MISMATCH-RIGHT-SUBRUNS and its couterpart for the left- 
rooted squares, together with Lemma 8.8.5. Let b; be the head position of B;. 
Then for each period p, K-subruns of period p found at this step belong to 
the interval bi —1..min{b; + 2p, ae —2}] . We call this interval the explored 
interval for b; and p. For each p, K-subruns found at this step can be seen 
as non-intersecting subintervals of this explored interval. These K-subruns are 
stored into a double-linked list, say £(7,p), in the increasing order of positions. 
(We leave it to the reader to check that such a list can be easily computed by 
the algorithm MISMATCH-RIGHT-SUBRUNS by making at each step a constant 
amount of extra work.) For p > (b; a 1)/2—1, the explored interval for Did 
has to be joined with the explored interval for bi, thus forming a bigger explored 
interval. Accordingly, lists £(i—1,p) and L(i, p) are joined. Note that if the last 
subrun of £(i—1,p) turns out to be adjacent to the first subrun of L(i,p), then 
those two subruns are merged into a single one. All additional operations take 
a constant time, and the resulting complexity of the first stage is still O(nK). 

The second stage is modified in a similar way. Let b; denote the head po- 
sition of factor f;. Recall that at each call of PROCESS-BLOCK we are searching 
for K-squares occurring between some factor head 6; and another factor head 
bj, and touching some factor head b; (j’ < i < j”). Moreover, no factor head 
between 6; and 6; has been processed yet. In this case, the explored interval 
is [max{b;, + 2p + 1, b; — 1}.. min{b; + 2p, bj — 2}], and we may have to merge 
it either with the previous explored siterval, or with the next one, or both. 

After the first and the second stages, we have computed lists of subruns of 
K-squares that touch all factor heads. Each list stores subruns of K-squares 
of some fixed period p touching heads of some successive factors fi, fi+1,--.. fj- 
Note that for each factor and each period, the corresponding list exists but can 
be empty. Each such list is accessed through two pointers, associated with the 
corresponding leftmost (f;) and rightmost (f;) factors. We denote these pointers 
left;(p) and right ;(p) respectively. These pointers are needed, in particular, for 
merging explored intervals at the second stage. An important remark is that 
at each moment there are only O(n) pointers that need to be stored. The key 
observation is that for each factor head b;, pointer left;(p) should be defined 
only for periods p < (b; — b;)/2 — 1, where by is the closest head on the left 
of b; that has been processed before. Similarly, pointer right,(p) is defined only 
for periods p < (bj — b;)/2 — 1, where bv is the closest head to the right of b; 
that has been processed before. On the other hand, all pointer manipulations 
add only a constant amount of work to each step of the second stage, and then 
the time complexity of the second stage stays O(nK log K). 

At the third stage, we have to find those K-subruns which lie entirely 
inside LZ-factors. For each period, potential occurrences of these K-subruns 
correspond precisely to the gaps between explored intervals. Thus, the third 
stage can be also seen as closing up, for each period, the gaps between explored 
intervals. The goal of the third stage is to construct, for each period p, a single 
list £[p] of all K-subruns of period p occurring in the word. In the beginning of 
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the third stage, L[p] is initialized to left, (p). 

As before, the key observation here is the fact that each LZ-factor without 
its head has a copy on the left (here required to be non-overlapping), and the 
idea is again to process w from left to right and to retrieve the K-subruns 
occurring inside each LZ-factor from its copy. However, the situation here is 
different in comparison to the previous section. One difference is that here 
subruns have to be “copied forward” when we process a factor copy, rather than 
to be retrieved at the time of processing the factor itself, as done at the second 
stage of MAXIMAL-REPETITIONS. Moreover, we may have to “cut out”, from a 
longer list, a sublist of K-subruns belonging to a factor copy and then to “fit” 
it into the gap between two explored intervals. The “cutting out” may entail 
splitting K-subruns which span over the borders of the factor copy, and “fitting 
into” may entail merging those K-subruns with AK-subruns from the neighboring 
explored intervals. Below we describe the algorithm for the third stage, which 
copes with these difficulties. The algorithm RUNS-THIRD-STAGE given below 
provides a detailed description of the third stage. 

During the computation of the Lempel-Ziv factorization, for each LZ-factor 
fi = va we choose a copy of v occurring earlier and point from the end position 
of this copy to the head position of f;. It may happen that one position has 
to have several pointers, in which case we organize them in a list. We traverse 
w from left to right and maintain the rightmost AK-run, of each period, which 
starts before the current position. This K-run is called the active run and is 
denoted A[p] in the RUNS-THIRD-STAGE algorithm. To this purpose, we also 
maintain the following invariant: at the moment we arrive at a position 7, we 
have the list, denoted S[7], of all A-subruns which start at this position. The 
lists S[2] are maintained according to the following general rule: for each K- 
subrun starting at the current position, we assign the start position of the next 
K-subrun in the list provided that this K-subrun exists (instructions 16-19 of 
RuNS-THIRD-STAGE). If the next subrun does not exist, we set a special flag 
islast in order to do it later. 

When we arrive at the end position of a copy of a LZ-factor, we have to 
copy “into the factor” all the K-subruns which this copy contains. Therefore, 
we scan backwards the K-subruns contained in the copy and copy them into 
the factor (instructions 24-29). After copying these K-subruns, we bridged two 
explored intervals into one interval, and linked together the two corresponding 
lists of K-subruns, possibly inserting a new list of K-runs in between (line 30). 
Copying K-subruns in the backward direction is important for the correction 
of the algorithm — this guarantees that no K-subruns are missed. It is also for 
this reason that we need the copy to be non-overlapping with the factor. 

The final part of the algorithm (lines 31-40) treats the situation when before 
executing line 30, right, _,(p) actually refers to the current list Lip]. If, in 
addition, L[p]| was empty before but became non-empty after the execution of 
line 30, we have to add the first subrun of L[p] to the corresponding list S$[2’] 
(lines 31-35). If the active run Alp] was the last run in L[p| before the execution 
of line 30 but is not the last one after this execution, we have to update the list 
S|i’] for the start position 2’ of the next subrun (instructions 36-40). 
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RuNS-THIRD-STAGE() 


> Alp] is maintained to be the last considered run of period p 
> S{[i] is maintained to be the list of runs starting at 7 
for each period p do 
L\p| — left, (p) 
islast|p] — false 
if L[p| is not empty then 
r — first run of L[p| 
i — start position of r 
add r to S{i] 
isempty|p| — false 
else isempty|p| — true 
for each position 7 € [1..n] do 
for each run r € S{i] do 
p< period of r 
Alp] <r 
if r is not the last run in L[p] then 
r’ —run next to r in L[p] 
i’ — first position of r’ 
add r’ to S/i’] 
else islast[p] — true 
for each factor copy v ending at position i do 
s — index of the factor corresponding to uv 
for each period p < |v|/2 do 
r= Alp 
while r is defined and r contains K-squares inside v do 
r’ — subrun of all these K-squares 
r — copy of r’ in fs 
add/merge r” to/with the head of list left, (p) 
r — predecessor of r in L[p] 
link/merge right ,_,(p) to/with left, (p) 
if isempty[p| and L[p] is no more empty then 
r’ — first run of L[p] 
i’ — start position of r’ 
add r’ to S{i’] 
isempty|p| — false 
if islast([p] and A[p] is no more the last run in £[p| then 
r’ —run next to Alp] in L[p] 
i’ — start position of r’ 
add r’ to S[i’] 
islast|p| — false 


After the whole word has been traversed, no more gaps between explored 
intervals exist anymore. This means that for each period p, L[p] is the list of all 
K-subruns of period p occurring in the word, which are actually the searched 


runs. 
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The complexity of the third stage is O(n +S), where S is the number of 
resulting K-runs. We show this by an amortized analysis of the RUNS-THIRD- 
STAGE algorithm. Specifically, we show that the total number of iterations of 
each loop in RUNS-THIRD-STAGE is either O(n) or O(S). Each iteration of the 
for-loop at line 13 processes a new K-run starting at position 7. Therefore there 
are O(S) iterations of this loop during the whole execution. Each iteration of the 
for-loop at line 21 treats a copy of a distinct Lempel-Ziv factor. Furthermore, 
the number of iterations of the nested for-loop at line 23 is the half of the 
length of the corresponding factor. Therefore, the total number of iterations of 
both for-loops is O(n). On the other hand, the while-loop at line 25 iterates 
O(n +S) times, as at each iteration, except possibly the first and the last one, 
it computes a new K-subrun, which becomes a completed K-run at that point. 
Thus, the overall time spent by all internal loops is O(n +S). The main for-loop 
(line 3) makes obviously O(n) iterations, and this completes the proof that the 
whole complexity of the third stage is indeed O(n + S). 

Putting together the three stages, we obtain the main result of this section. 


THEOREM 8.8.7. All K-runs can be found in time O(nK log K + 5’) where n 
is the word length and S is the number of K-runs found. 


Once all K-runs have been found, we can easily output all K-squares. We 
then have the following result. 


COROLLARY 8.8.8. All K-squares can be found in time O(nK log K +S) where 
n is the word length and S is the number of K-squares found. 


8.9. Notes 


Section 8.0. We refer to Storer 1988 for applications of repetitions to compres- 
sion techniques. Galil and Seiferas 1983, Crochemore and Rytter 1995, Cole 
and Hariharan 1998 illustrate how repetitions are used in pattern matching. 
Kolpakov et al. 2003 discusses the role of repetitions in DNA sequences. More 
on biological origin and function of repeated sequences in genome sequences can 
be learnt, e.g., in Brown 1999. 


Section &.1. Basic definitions and results of word combinatorics, including 
Proposition 8.1.1 and Theorem 8.1.2 and Proposition 8.1.5, can be found in 
Lothaire 1997. The exponent is called the order in Chapter 8 of Lothaire 2002. 
Maximal repetitions were called maximal periodicities in Main 1989 and runs in 
Tliopoulos et al. 1997. 


Section 8.2. The proof of Lemma 8.2.2 is attributed to D. Hickerson and was 
communicated to us by M. Crochemore and D. Gusfield. The lemma is a weaker 
and easier-to-prove version of a result from Crochemore and Rytter 1995 assert- 
ing that under conditions of Lemma 8.2.2, the stronger inequality |y| + |z| < |x| 
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holds. The latter implies that the number of primitively-rooted squares occur- 
ring as prefixes of a word w is less than log,, |w| (y is the golden ratio), which 
is a better bound than log g|w| implied by Lemma 8.2.2 (cf Theorem 8.2.1). 
Moreover, the bound log, |w| is asymptotically tight, as realized by Fibonacci 
words. 

Lemma 8.2.3(i) is a “folklore result” (see e.g. Pirillo 1997), together with 
the fact that the common prefix of fr—1fn—2 and fn—2fn—1 of length F,—2 isa 
palindrome. Lemma 8.2.3(ii) appeared in De Luca 1981. The proof given here 
was communicated to us by J. Berstel. Theorem 8.2.4 was proved in Crochemore 
1981. 

Lemma 8.2.3(iii) was proved in the PhD thesis of P. Séébold (1985). Lemma 
8.2.3(iv) appeared in Karhumaki 1983. Later, repetitions in Fibonacci words 
have been extensively studied in Mignosi and Pirillo 1992, Pirillo 1997 where it 
was proved, in particular, that they contain no repetition of exponent greater 
than 2+y = 3.618.. but do contain repetitions of exponent greater than 2+yp—¢ 
for every € > 0. 

An exact formula for the number of squares Fibonacci word f, was obtained 
in Fraenkel and Simpson 1999. It implies that this number is asymptotically 
2(3 — y)nF, + O(Fn) ¥ 0.7962 - Fy, logy Fn + O(Fn). 

It is interesting to note that if we count distinct squares rather than square 
occurrences, their maximal number is asymptotically linearly bounded on the 
word length. In Fraenkel and Simpson 1999, it has been shown that Fibonacci 
word f,, contains 2(F,-2 —1) = 2(2—y)F, + O(1) distinct squares. In Fraenkel 
and Simpson 1998, it has been proved that the number of distinct squares in 
general words of length n is bounded by 2n (for an arbitrary alphabet). It is 
conjectured that this number is actually smaller than n. Thus, in contrast to 
square occurrences, the maximal number of distinct squares is linear. 

A linear bound on the number of maximal repetitions in Fibonacci words was 
first obtained in Iliopoulos et al. 1997 by presenting a linear-time algorithm enu- 
merating all of those. The direct formula given here was obtained in Kolpakov 
and Kucherov 2000b. Since Fibonacci words don’t contain exponents greater 
than (2+y), Theorem 8.2.5 implies that the sum of exponents of all maximal rep- 
etitions in f,, is no greater than (2+ y)(2F,-2—3) = 2(2—)(24+y~) Fn +O(1) = 
2(3—y)F,+O(1) © 2.764: F,. While an exact formula for the sum of exponents 
is not known, a more precise estimate was obtained in Kolpakov and Kucherov 
2000b, where it was shown that the sum of exponents of all maximal repetitions 
in Fibonacci word f,, is (C- Fn + 0(F,)), where 1.922 < C < 1.926. 

A complete proof of Theorem 8.2.7 can be found in Kolpakov and Kucherov 
2000b. 

On a different but related topic, several studies have been done on the min- 
imal, rather than maximal, number of repetitions in words. In particular, it is 
well-known (Lothaire 1997) that an arbitrary long word with no squares (and 
therefore no repetitions at all) can be constructed on a three-letter alphabet. 
In Fraenkel and Simpson 1995 it was shown that for the binary alphabet, there 
exists an infinite word containing three distinct squares (e.g. 00, 11, 0101) and 
three is the minimal bound. Complementary, in Kucherov et al. 2003 it was 
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shown that the minimal number of square occurrences in an infinite binary word 
is, in the limit, a constant fraction of the word length, and that this constant is 
0.55080... 


Section 8.8. Longest extension functions have been introduced in Main and 
Lorentz 1984; a closely related idea was used independently in Crochemore 
1983. The algorithm presented here for computing longest extension functions 
is a refinement of the algorithm of Main and Lorentz 1984. 

Using s-factorization (called f-factorization in Crochemore and Rytter 1994) 
for finding repetitions was first proposed in Crochemore 1983. The Lempel-Ziv 
factorization is directly related to the Lempel-Ziv compression algorithm (Ziv 
and Lempel 1977) and to the underlying definition of complexity of a string 
(Lempel and Ziv 1976). A discussion on two types of factorization can be found 
in Gusfield 1997. The linear-time computation of Lempel-Ziv factorization was 
used in Rodeh et al. 1981. Linear-time construction algorithms for the suffix 
tree are described in McCreight 1976, Ukkonen 1995 and for the DAWG in 
Blumer et al. 1985, Crochemore 1986. 


Section 8.4. First papers on finding repetitions in words are Crochemore 1981, 
Slisenko 1983. Crochemore 1981 proposed an O(n log n) algorithm for finding all 
occurrences of non-extensible primitively-rooted integer powers in a word. Using 
a suffix tree technique, Apostolico and Preparata 1983 described an O(n log n) 
algorithm for finding all right-mazimal repetitions, which are repetitions that 
cannot be extended to the right without increasing the period. Main and Lorentz 
1984 proposed another algorithm that finds all maximal repetitions in O(n log n) 
time. They also pointed out the optimality of this bound under the assumption 
of unbounded alphabet and under the restriction that the algorithm is based 
only on letter comparisons. 

Crochemore 1983 described a simple and elegant linear-time algorithm for 
finding a square in a word (and thus checking if a word is repetition-free). An- 
other linear algorithm checking whether a word contains a square was proposed 
in Main and Lorentz 1985. 

Using s-factorization, Main 1989 proposed a linear-time algorithm which 
finds all leftmost occurrences of distinct maximal repetitions in a word. This 
algorithm basically corresponds to the first stage of MAXIMAL-REPETITIONS. 
In particular, Theorems 8.4.2 and 8.4.1 are from Main 1989. The linear-time 
algorithm presented here is from Kolpakov and Kucherov 1999. 

As far as other related works are concerned, Kosaraju 1994 described an 
O(n) algorithm which, given a word, finds for each position the shortest square 
starting at this position. Stoye and Gusfield 1998 proposed several algorithms 
that are based on a unified suffix tree framework. Their results are based on an 
algorithm which finds in time O(nlogn) all branching tandem repeats. 


Section 8.5, 8.6. The results of those sections are from Kolpakov and Kucherov 


2000a. A more general problem has been considered in Brodal et al. 2000: 
find all gapped repeats with a gap size belonging to a specified interval. The 
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proposed algorithm has the time complexity O(nlogn +S‘), where S' is the size 
of the output. 


Section 8.7. We refer to Duval 1998, Duval et al. 2001 for studies of properties 
of local periods. The Critical Factorization Theorem is presented in Lothaire 
1997, Choffrut and Karhumaki 1997, Lothaire 2002. For recent developments 
of the Critical Factorization Theorem see Mignosi et al. 1995. 

The results of Section 8.7 are based on Duval et al. 2003. 


Section 8.8. The problem of finding approximate squares for both Hamming and 
edit distances has been first studied in Landau and Schmidt 1993. Computing 
generalized longest extension functions in time O(n/.) can be done by a method 
based on the suffix tree and the computation of the nearest common ancestor 
described in Gusfield 1997. 

The results presented in the section are from Kolpakov and Kucherov 2003. 

Algorithms of Sections 8.4 and 8.8 have been implemented in the mreps 
software for finding tandem repeats in DNA sequences. For more information 
about the software and experimental results obtained with it, we refer to the 
Web-site http://www.loria.fr/mreps/ and the publication Kolpakov et al. 
2003. 
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9.0. Introduction 


This chapter illustrates the use of words to derive enumeration results and 
algorithms for sampling and coding. 
Given a family C of combinatorial structures, endowed with a size such that 
the subset C,, of objects of size n is finite, we consider three problems: 
— Counting: determine for all n > 0, the cardinal Card(C,,) of the set C,, of 
objects with size n. 
— Sampling: design an algorithm RANDC that, for any n, produces a random 
object uniformly chosen in C,,: in other terms, the algorithm must satisfy, 
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for any object O € C,, P(RANDC(n) = O) = 1/Card(C,.). 
— Optimal coding: construct a function y that maps injectively objects of C 
on words of {0,1}* in such a way that an object O of size n is coded by 
a word y(O) of length roughly bounded by log, Card(C,,). 
These three problems have in common an enumerative flavour, in the sense 
that they are immediately solved if a list of all objects of size n is available. 
However, since in general there is an exponential number of objects of size n 
in the families we are interested in, this solution is by no way satisfying. For 
a wide class of so-called decomposable combinatorial structures, including non 
ambiguous algebraic languages, algorithms with polynomial complexity can be 
derived from the rather systematic recursive method. Our aim is to explore 
classes of structures for which an even tighter link exists between counting, 
sampling and coding. 

For a number of natural families of combinatorial structures, the counting 
problem has indeed a “nice” solution: by nice could be intended that there 
is a simple formula for Card(C,,), that the generating series }7,,.) Card(Cn)a” 
is an algebraic function, etc. The rationale of this chapter is that these nice 
enumerative properties are the visible “traces” of deeper structural properties, 
and that making the latters explicit is a way to solve simultaneously and simply 
the three problems above. 

The enumeration of walks on lattices (Section 9.1) is an inextinguishable 
source of nice counting formulas. These formulas can often be given simple 
interpretations by viewing walks as words on an alphabet of steps, and using 
ingredients of the combinatorics of words. In particular we shall consider some 
rational and algebraic languages, shuffles and the cycle lemma. 

Convex or directed polyominoes (Section 9.2) illustrate the idea that nice 
combinatorial properties help for sampling. Since enumeration and random gen- 
eration of general polyominoes appear intractable, it was proposed in statistical 
physics to study subclasses like convex or directed polyominoes, that display 
better enumerative properties. These objects can be described in terms of sim- 
ple languages, often algebraic, and this leads to efficient random generators. 

The family of planar maps (Section 9.3) is a further example of class with 
unexpectedly nice enumerative properties. Maps are the natural combinatorial 
abstraction for embeddings of graphs in the plane and for polygonal meshes 
in computational geometry, and maps were also largely studied in theoretical 
physics. Toy models of statistical physics, like percolation or the Ising model, 
are often studied on regular lattices, but also on random maps. The uniform 
distribution indeed appears to give, at the discrete level, the right notion of dis- 
tribution of probability on possible universes as prescribed by quantum gravity. 
In these various contexts, results have been obtained independently on counting, 
sampling and coding problems. Again we rely on a combinatorial explanation 
of the enumerative properties of planar maps to approach these three problems. 

Most of the time, we state and prove results for some particularly simple 
structures, while they are valid for more generic families (e.g. walks with more 
general steps, polyominoes on other lattices, maps with constraints). We made 
this choice to maintain the chapter relatively short, but also because on these 
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simple structures the “traces” are more visible, and the underlying combina- 
torics appears more explicitly. 

All the objects that are considered in this chapter have nice geometric in- 
terpretations in the plane. We have chosen to rely on the geometric intuition of 
the reader to support these interpretations, and concentrate the proofs on the 
combinatorial aspects. 


9.1. Counting: walks in sectors of the plane 


A (nearest neighbor) walk on the square lattice Z? is a finite sequence of vertices 
w = (Wo, W1,--.,;Wn) in Z? such that each step w; —wj_1, for 1 < i < n, belongs 
to the set S = {(0,1), (0, —1), (—1,0), (1,0)}. The number of steps n is the length 
of w; wo and wy, are respectively its startpoint and endpoint. The reverse walk 
of w is the walk © = (Wp, Wn—1,---,W1, Wo). A loop is a walk with identical 
startpoint and endpoint. 

Elements of S are also denoted u,d,l,r — standing for up, down, left and 
right. Unless explicitly specified, we consider walks up to translation, or equiv- 
alently, we assume that they start from the origin (0,0). A walk can thus be 
seen as a word on the alphabet S = {u,d,l,r} and we identify the set of walks 
with the language {u, d,l,r}*, making no distinction between both of them. 

In the rest of this section, we study families of walks with various boundary 
constraints: on a line, a half line, a half plane, a quarter plane, and finally, on 
the slitplane. This is the occasion to introduce enumerative tools that will be 
of use in later sections. 


9.1.1. Unconstrained walks and rational series 


Let us first consider walks that use only vertical steps (7.e. u or d), and hence 
stay on the axis (a = 0). These walks are sometimes called one-dimensional 
simple symmetric walks, and are often considered in their “time stretched” ver- 
sion: each step u or d is replaced by a (1,1) or (1, —1) step, in order to give an 
unambiguous representation in the plane, as illustrated by Figure 9.1. Up to 
a 7/4-rotation, these walks are in one-to-one correspondence with walks with 
steps in {u,r} and as such, are sometimes called staircase walks, or directed 
two-dimensional walks. 

Counting these walks with respect to their length @ amounts to counting 
words on {u,d} of length @, and there are 2° of those. Restricting them to end 
at ordinate j, with @ = 2n + |j| for some nonnegative n, is hardly more difficult: 
for 7 > 0, the corresponding words are arbitrary shuffles of n+ 7 letters u and 
n letters d, and similarly for 7 < 0, they are shuffles of n letters wu and n — 7 
letters d. Hence the number of walks of length 2n + |j| ending at ordinate 7 is 


@an! 


It will be convenient to express enumerative results in terms of languages 
and generating functions. In this case, the language V of walks on the vertical 
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{2 


(a) On an axis, (b) stretched, (c) and rotated. 


Figure 9.1. Three representations of the one-dimensional walk duwuwudu. 


axis is just {u,d}*. Equivalently, in the algebra Q((u, d)) of formal power series 
in non commuting variables, the language V (viewed as the formal sum of its 
words) is uniquely defined by the linear equation: 


Veet(utd)y, (9.1.1) 


which corresponds to the non ambiguous decomposition: “a walk is either the 
empty walk or made of a step u or d followed by a walk”. 

Define now 6(w) = |w|, —|wla for any word w on S, so that d(w) is the final 
ordinate of the walk w. The generating function of the language V with respect 
to the length (variable t) and the final ordinate (variable y) is 


= So daly), 


wEeV 


which is an element of the algebra Q(y)|[¢]] of formal power series in the variable 
t with coefficients that are rational functions in y. 

Observe that |.| and 6 are morphisms of monoids (S*,-) — (Z,+), so that 
V(t; y) can be viewed as the commutative image of Y by the morphism of algebra 
wr tly) from Q((u,d)) to Q(y)[[é]]. Taking the commutative image of 
Equation 9.1.1, the generating function V(t; y) satisfies: 


Vigy) =14+ (ty t+ty Vy). 


An explicit expression of V(t; y) follows, and its expansion of course agrees with 
the previous direct enumeration: 


an =o F- LE )ery _ 


m=0 k=0 


The commutative image mechanism produces a priori a formal power series 
of Q(y){[t]], but, as in the present example, it retains properties of the initial 
language: the series V(t; y) of the rational language {u, d}* is a rational function 
of t and y, i.e. belongs to Q(t, y). Walks with more general steps are dealt with 
in a similar way: for instance the language W associated to walks in Z? is S* 
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(a) a Dyck word, (b) a Dyck prefix. 


Figure 9.2. The family of Dyck words (stretched representations). 


and the generating function of these walks with respect to the length and the 
coordinates of the endpoint is: 
1 


W (t; eS 


Another illustration is given by the family of walks that never immediately undo 
a step they have just done. Their language is the set of words avoiding the factors 
{ud, du, lr, rl} which is well known to be rational. Accordingly their generating 
function with respect to the length and the coordinate of the endpoints belongs 
to Q(t,z,y). Conversely, when the generating function of a set of objects is 
rational, it is natural to try to encode them by words of a rational language. 


9.1.2. Walks on a half line and Catalan’s factorization 


We shall now consider walks that stay on the upper half axis (a = 0, y > 0). 
More precisely let the depth of w be the absolute value of the minimal ordinate 
6(v) for all prefixes v of w. Walks that stay on the upper half axis are exactly 
the walks with depth zero, and this condition is called the nonnegative prefix 
condition. Loops satisfying the nonnegative prefix condition are often called 
Dyck words on the alphabet {u,d}. In turn, walks satisfying the nonnegative 
prefix condition are sometimes referred to as Dyck prefixes, since any of them 
can be completed into a Dyck word. See Figure 9.2 for examples. Let D denote 
the language of Dyck words and D,, the set of Dyck words of length 2n. The 
following lemma gives a central role to Dyck words. 


LEMMA 9.1.1 (Catalan’s factorization). The language {u,d}* of one-dimen- 
sional walks admits the following non ambiguous decomposition: 


{u,d}* = (Dd)*D(uD)*. 
More precisely, the language of walks with depth ¢ and ending at ordinate j is 
(Dd)’D(uD)** 
Proof. For any word w on the alphabet {u,d} with depth @ and final ordinate 


j, such a factorization is obtained at first passages from ordinate 7+ 1 to 2 for 
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Wa, ‘W5 


Figure 9.3. Catalan’s factorization of a walk in (Dd)*D(uD)?. 


i= —-1,...,—¢ and last passages from ordinate i toi+1 for i= —¢,...,j—1. 
The uniqueness of the decomposition follows from the fact that any strict prefix 
v of a word in Dd satisfies d(v) > 0 by definition of D, and hence does not 
belong to Dd. a 


Catalan’s factorization immediately allows us to derive the total number of 
walks on the half line. 


PROPOSITION 9.1.2. The number of Dyck prefixes of length m is 


Proof. A Dyck prefix of even length is a walk with depth zero and even final 
ordinate 27 for some integer 7 > 0. According to Lemma 9.1.1, the language of 
these words is D(uD)*/. Upon changing the j first factors u in factors d, words 
of length 2n in this language are in bijection with words of length 2n in the 
language (Dd)JD(uD), i.e. with words of the language of loops with depth j. 
Hence Dyck prefixes of length 2n are in bijection with loops of the same length, 
and their number is (), 

Similarly, a Dyck prefix of odd length ends at ordinate 27+ 1, for some j > 0. 
But words of equal length in the languages D(wD)”/+! and (Dd)/D(uD)/*! are 
in bijection. The union of the last languages for all 7 > 0 is the set of words w 
with 6(w) = 1, (°"*") of which have length 2n + 1. 7 


n 
The previous proof can be summarized as follows: find a factorization into 
Dyck factors separated by some specific steps (typically first or last passages), 
and then reorganize the factorization without modifying the Dyck factors. We 
shall apply this principle again to give a bijective enumeration of Dyck words. 


PROPOSITION 9.1.3. The number of loops of length 2n that stay on the half 
axis (x = 0, y > 0) is the n-th Catalan number: 


C, = 1 (*"). 
n+1l\n 
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Proof (as a corollary of Proposition 9.1.2). Removing the last step of a Dyck 
prefix of length 2n + 1 yields a prefix of length 2n. In this way every Dyck 
prefix of length 2n is obtained twice, except for Dyck paths that are obtained 
only once. Hence os = aC) —CardD,, and the formula follows. P] 


Proof (direct bijection). We prove the relation (n+ 1) CardD,, = eal by giving 
a bijection between the set of pairs (v,v’) with vv’ € D, and v empty or ending 
with a letter u, and the set of loops of length 2n. To do that we first state two 
factorizations that follow from Lemma 9.1.1: 

— the set of pairs (v,v’) as above with 6(v) = @ is (Du)* x D(dD)*; 

— the set of loops with depth @ is (Dd)’D(uD)*. 
Exchanging u and d factors in these decompositions leads to the announced 
bijection. a 


The same idea allows to refine the enumeration of Dyck prefixes. 
PROPOSITION 9.1.4. The number of Dyck prefixes of length 2n + j and final 


ordinate j > 0 is 
gjtl f2n+3 
n+jt+l1 n i 


Proof. We prove the formula by giving a bijection between pairs (w,7) where w 
is a walk with 6(w) = j andi € {0,...,7}, and pairs (w’,k) where w’ is a Dyck 
prefix with d(w’) = 7 and k € {0,...,n+ 7}: 
~ to any pair (w,7) as above, associate (w;,..., Wj, Wo,---,Wi—1) where wo is 
the loop and the other wy are the Dyck paths such that w = wowwy --- uw; 
(this is the decomposition at the last passages at level 0, ..., 7). 
~ to any pair (w’,k) as above, associate (wo,..., Wj,..-,w)), where the w% 
are the Dyck words such that w’ = www; --- uw}, 7 is the index of the wi 
containing or following the kth letter u in the word uw’, and w; = (v,v’) 
is the factorization of w} after this letter. 
The bijection in the second proof of Proposition 9.1.3 allows to transform the 
pair &; = (v,v’) in a loop, so that both sets are associated to the same set of 
sequences of 7 + 1 walks. a 


9.1.3. Walks on a half plane and algebraic series 


Walks in the half plane (y > 0) are hardly more complicated to enumerate than 
walks on the half line. Indeed, as words on the alphabet S, these walks are 
completely characterized by the fact that all their prefixes v contain at least as 
many letters u as letters d. Hence the associated language is the set of shuffles 
of vertical Dyck prefixes with sequences of horizontal steps. Various formulas 
can be derived from this characterization: for instance, the number of loops of 
length 2n that stay in the half plane (y > 0) is 


% (at) Gr) 
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Figure 9.4. An excursion in the half plane. 


Rather than going further in this direction, we shall observe that the set of 
these walks is an algebraic language and return to generating functions. Con- 
sider the alphabet A, = {u,d,21,...,2,}, and the monoid morphism 6 defined 
as previously by 6(w) = |w|, — |wla. The language M“) of k-colored Motzkin 
words is the set of words w on the alphabet A; satisfying d(w) = 0 and the 
nonnegative prefix property. For k = 0 this is the Dyck language. For k = 2, 
upon setting x; =I, x2 =r, bicolored Motzkin words are excursions in the half 
plane, i.e. walks in the half plane (y > 0) that finish on the axis (y = 0). 

The language of k-colored Motzkin words admits an algebraic description: 


MM) = e+ (ai t...+2%)MY 4+uM®dmM®, (9.1.2) 


which derives immediately from the non ambiguous decomposition of any non 
empty Motzkin word at its smallest non empty prefix v such that d(v) = 0. 
Taking the commutative image, the generating function M()(t) = arse tlvl 
of the Motzkin language with respect to the length satisfies the equation: 


M“)(t) = 1+ ktM (t) +?2M (2)?. (9.1.3) 


Observe that this equation completely determines M\*)(t), since it has a unique 
solution in the space of formal power series in the variable t (as can be checked 
by induction, extracting the coefficient of t” on both sides). 

Any additive parameter can be taken into consideration in the commutative 
image. For instance the previous algebraic decomposition yields the following 
proposition in the case of bicolored Motzkin words. 


PROPOSITION 9.1.5. The generating function for walks in the half plane re- 
turning to the axis (y = 0), with respect to their length, abscissa of the endpoint 
and number of vertical steps, is: 


1—t(x + +) [1 -—t(2@+442z)|[l-t(2+4—-2z)| 


MO (t;2,z) = Dee 
z 


Proof. Taking the commutative image with the map w — ¢!¥/g!wlr—lwl: zlwlutlwla 
yields the equation 


iL 
M@)(t:a,z) =14+t(a+ —)M) (t; 2, z) +? 22M) (t; 2, z)?. 
x 


Version June 23, 2004 


9.1. Counting: walks in sectors of the plane 451 


The discriminant of this equation is 
il 
A(t; 2, z) = [t(2 + —) — 1]? — 4¢?2?, 
x 


and among the two roots of the quadratic equation, only the one of the propo- 
sition is a formal power series in t. rT 


Equation (9.1.3) shows that the series M(*) (t) satisfies a relation of the form 
P(M)(t),t) = 0 with P a polynomial, which means that it is an algebraic 
formal power series. This illustrates the fact that algebraic languages that 
admit a non ambiguous algebraic description naturally have algebraic generating 
functions with respect to additive parameters. Conversely, when the generating 
function of a set of objects is algebraic, one would like to obtain it from an 
algebraic description of the objects (or more formally from an encoding of the 
objects by the words of an algebraic language with a non ambiguous description). 
In this sense, Equation (9.1.2) is more satisfying than Catalan’s factorization, 
even though the commutative image of the latter also induces an algebraic 
equation. 

Expanding the generating function M()(t,1,1) = (1 — 2t— V1 — 4#)/2¢? in 
powers of t, one observe the following amusing result (cf. Problem 9.1.5). 


COROLLARY 9.1.6. The number of bicolored Motzkin words of length n is 
given by the Catalan number Cy+1. 


Loops in the up diagonal quadrant (a+y > 0, y > x) are simple to describe: 
let w be such a loop of length 2n, and consider the projections of the walk on 
the two diagonals (a = y) and (a = —y). Let {a,b} be the elementary steps on 
these two axes, with a corresponding to up steps and b to down steps. Steps in 
Z? have the following projections: 


u —> (a,a) d — (b,b) 1 —> (b,a) r —> (a,b) 


and the projections of w on the diagonals are Dyck words of length 2n on 
{a,b}; reciprocally any pair of Dyck words of same length over this alphabet 
corresponds to a loop in the up diagonal quadrant. Hence: 


PROPOSITION 9.1.7. The number of loops of length 2n that stay in the diag- 
onal quadrant («+ y > 0, y > 2) is given by: 


Ce. 


More generally, any walk of length 2n 4+ |i| + 7 and endpoint (7,7) in the up 
diagonal quadrant is described by its projections on the two diagonal axes; these 
projections are decoupled Dyck prefixes of length 2n + |2| + 7 with respective 
ordinate of the endpoint 7+ 7 and 7 — i. Hence: 
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74 


(a) A walk in the diagonal up quadrant. (b) A loop in the first quadrant. 


Figure 9.5. Walks in quadrants. 


PROPOSITION 9.1.8. The number of walks of length 2n + |i| + 7 and endpoint 
(7,7) that stay in the diagonal quadrant (2 + y > 0, y > x) is given by: 


G+it+1G-i+1) ee = 
(n+ 9+ lt] + D(n+74+1) nt |i n , 


and the total number of walks of length n that stay in the diagonal quadrant 
(c+ y >0, y > 2) is given by 


The case of loops in the first quadrant (a > 0, y > 0) is quite similar. These 
loops are words w on S such that both restrictions of w to {u,d} and to {l,r} 
are Dyck words; hence the language of loops in the first quadrant is the shuffle 
of the Dyck languages on {u,d} and {l,r}. 


PROPOSITION 9.1.9. The number of loops of length 2n that stay in the quad- 
rant (x >0, y > 0) is given by: 


" (2n 1 Qn+2\? 
ps (5) OeCn-s ~ Qn +1)(Qn + 2) rt) 


k=0 


The general case of walks with given length and endpoint or with given 
length is similar to the case of the diagonal quadrant and left to the reader. 


A remarkable consequence of these formulas is that the languages of walks 
in the diagonal (or in the standard quadrant) cannot be an algebraic language: 
on the one hand the asymptotic number of walks of length n in the diagonal 
quadrant, bia ie grows like 4"/n when n goes to infinity; on the other hand, 
the possible asymptotic behaviors of the Taylor coefficients of an algebraic series 
are classified, and do not include the form p"n~* for i a positive integer; therefore 
the generating function of walks in the diagonal quadrant is not algebraic, and 
the associated language cannot be algebraic either. 
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(a) A walk on the slitplane. (b) The factorization of a walk. 


Figure 9.6. On the slitplane. 


9.1.4. Walks on the slitplane and the cycle lemma 


We call slitplane the complement of the half axis (2 = 0,y < 0) in the square 
lattice Z?. Walks on the slitplane are defined as walks that do not touch this half 
axis except maybe at their startpoint or endpoint, as shown in Figure 9.6(a). 

The tool we shall use to enumerate walks on the slitplane is the so-called 
cycle lemma. For any alphabet A endowed with a morphism 6 : (.A,-) — (Z,+), 
a word w in A* is said to have the Lukasiewicz property if every strict prefix v 
of w satisfies d(v) > d(w). 


LEMMA 9.1.10 (Cycle lemma). Let A be an alphabet endowed with a mor- 
phism 6 : (A,-) — (Z,+). Then a word w in A* such that 6(w) = —1 ad- 
mits a unique factorization wyw2 with w, non empty such that w2w, has the 
Lukasiewicz property. 


Proof. Let w, be the shortest prefix of w with 6(w1) equal to the depth of w. 
Then w2w, has the Lukasiewicz property. Moreover, let us verify that there is 
no other such factorization. First assume that w} is a prefix of w shorter than 
w,. Then the prefix w” of w of length |w1|— |w‘{| satisfies d(w”) < 0 and is 
also a strict prefix of whw'. Hence ww, has not the Lukasiewicz property. It 
remains to consider the case of a prefix w{, of w longer than w;. The suffix w” 
of wi of length |w{| — |w1| satisfies d(w) > 0 and is also a suffix of w,w{. Since 


moreover 0(wsw)) = —1, whw, has not the Lukasiewicz property. : 
CoROLLARY 9.1.11. Consider the alphabet A = {aj,a2,...,a,}, endowed 
with a morphism 6, and let ny,n2,...,n,~ be nonnegative integers such that, 

k 


> n;0(ai) =-l. 


i=l 


Then the number of words with n; letters a; for any 1 <i<k that have the 
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Lukasiewicz property is equal to: 


1 nyt... +np 
Ni Fe.wor Me M5 sey Mh , 
Proof. For any word w as above, 6(w) = —1, so that the conjugacy class of w 


contains |w| different words. According to the cycle lemma exactly one of these 
n, +---+n, words has the Lukasiewicz property. The formula follows. rT 


For A = {u,d} with d(u) = 1, d6(d) = —1, the set of words enumerated 
by the previous corollary is the Dyck-Lukasiewicz language Dd, and we recover 
Proposition 9.1.3. 


COROLLARY 9.1.12. Let C be a code for a set of words on the alphabet A. 
Then the generating function (with respect to the length) for Lukasiewicz words 
w in C* such that 6(w) = —1 is equal to 


1 

-1 

ice —<——— = 
[y“] log 1—Othyy’ 
where C(t; y) is the generating function of the code C with respect to the length 
(variable t) and to 0 (variable y). 


Proof. The generating function of words on the alphabet A with k factors in C is 
C(t;y)*. Restricting the generating function to words w with 6(w) = —1 is done 
by taking the coefficients of y~! in the series. The fraction of these words that 
have the Lukasiewicz property is then 1/k, so that their generating function is 


> alee y)* = ly") log 


k>1 
| 


To study walks on the slitplane, it is natural to decompose them at points 
where they touch the vertical axis (c = 0), as shown in Figure 9.6: any walk 
w on the plane that finishes on the vertical axis can be uniquely factored into 
vertical steps on this axis and primitive excursions in the left or right half plane; 
in other terms, the language of these walks is 


(ut d+IMOr + rM™1)* 


where M and M(") respectively denote the set of excursions in the left half 
plane (x < 0) and in the right one (a > 0). Hence the set {u, d}UIMrUrM™] 
forms a code C for walks on the plane ending on the vertical axis: these walks 
can thus be viewed as walks on the axis (# = 0) with the infinite set of steps C. 

To apply the cycle lemma to walks on the slitplane, we consider again the 
morphism 6(w) = |w|,—|w|a. Let us single out the class of walks on the slitplane 
that start at position (0,1) and end on the half axis at position (0,0): these 
walks are exactly the Lukasiewicz words w in C* such that d(w) = —1. 
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(a) A walk w on the slitplane. (b) The half plane excursion y(w). 


Figure 9.7. On the slitplane. 


PROPOSITION 9.1.13. The number of walks on the slitplane with startpoint 
(0,0), endpoint (0,1) and length 2n + 1 is: 


Ch = b (4n +? 
2 On EO IG 1) 


Proof. Let C(t;y) be the commutative image of C, so that 1/(1—C(t;y)) is 
the generating function of words on the code C. Observe that a 7/2- (respec- 
tively —1/2-) rotation maps bijectively words of length n in M (resp. M(7)) 
on words of length n in the bicolored Motzkin language M), hence Proposi- 
tion 9.1.5 yields: 


lo : = lo : + lo 
°1— Oty) 28 Tey isa PF 1-iy +i 
1 t” Ps es 
=5>o— (+342) +(y+t—2)"). 
n>1 


1 


The formula follows by extracting the coefficient of y~* and resumming. a 


The above proof does not yield an interpretation of the occurrence of Catalan 
numbers in Proposition 9.1.13. We conclude this section with a more direct 
derivation. 


Proof of Proposition 9.1.13 (bis). We are interested in walks w such that 

— Jwhe = flr, and |wla = |wle +1, 

— and for any strict prefix vu of w, either |u|; ¥ |v|-, or |vlu > lula. 
The first condition accounts for the displacement between the startpoint and 
endpoint, while the second one ensures that the walks stay in the slitplane. Let 
us describe a one-to-one correspondence y between these walks and excursions 
of even length in the half plane (bicolored Motzkin words). The result then 
follows from Corollary 9.1.6. 

Let w be a walk as in the proposition. Since |w|g = |w], +1, Lemma 9.1.10 
yields a unique factorization of w in w;dwe2 such that each proper prefix v of 
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(a) A polygon, (b) an animal, (c) and a polyomino. 


Figure 9.8. Three related classes of objects. 


wud satisfies |v], > |v|a: this is the factorization at the first arrival to the 
lowest level. Let wz be the walk that is symmetric to w2 with respect to the 
vertical axis (v = 0), and y(w) be equal to ww . Then y(w) is a bicolored 
Motzkin word, corresponding to an excursion in the half plane (y > 0) of length 
2n. Moreover the factorization w2w; of y(w) is the factorization at the first 
passage on the lowest point on the vertical line of equidistance between the 
startpoint and endpoint of y(w). 

Conversely, given a bicolored Motzkin word w’, let ww be its factorization 
at the first passage on the lowest point on the vertical line of equidistance 
between its startpoint and endpoint. Let ~(w’) = whdw. The walk 7(w’) is 
clearly a walk in the slitplane from (0, 1) to (0,0), and y(¢(w’)) = w’. Moreover, 
w(y(w)) = w for any walk w as in the proposition, and this concludes the proof. 

a 


As discussed in Section 9.1.3, the language of bicolored Motzkin words has 
a very natural algebraic decomposition. However this decomposition does not 
carry very well through the bijection. 


9.2. Sampling: polygons, animals and polyominoes 


A walk on the square lattice Z? is called a self-avoiding walk, or a path, if it 
visits at most once each vertex of the lattice. A self-avoiding polygon, or simply 
in this text, a polygon, is a self-avoiding loop. 

An animal is a set A of vertices of the lattice such that any two vertices of 
A are connected by a path visiting only vertices of A. Animals are considered 
up to translations of the lattice. Placing a unit square centered on each vertex 
of A, we obtain a polyomino. The latter are however more naturally defined as 
edge-connected sets of squares of the lattice. These definitions are illustrated 
by Figure 9.8. Each polygon is the contour (or the boundary) of a simply- 
connected polyomino, and in the plane this is a one-to-one correspondence (see 
Figures 9.10, 9.11 and 9.12). In particular the length of a polygon corresponds 
to the perimeter of the polyomino. A polygon has moreover dimension (p,q) 
if the smallest rectangle in which it can be inscribed has horizontal width p 
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(a) A convex polyomino. (b) A directed polyomino. 


Figure 9.9. Subclasses of polyominoes. 


and vertical width qg. Finally the area of a polyomino is its number of cells, 
corresponding for animals to the number of vertices. 

Little can be said from the enumerative point of view on animals, polygons 
or polyominoes in general. Two ideas have however been particularly successful 
for defining subclasses amenable to mathematical study and still of interest: 
restriction to convex or to directed objects. A polygon of dimension (p,q) is 
convex if its length is 2p+2q. This definition stresses the fact that convex poly- 
gons are in some sense the most extended polygons, and do not make meanders. 
An equivalent, but maybe more appealing interpretation is in terms of polyomi- 
noes: a polyomino is convex if its intersection with any horizontal or vertical 
line is connected. A polyomino (respectively an animal) is directed if there is 
a cell (resp. a vertex) from which every cell (resp. vertex) can be reached by 
a path going up or right inside the object. These definitions are illustrated by 
Figure 9.9. 


9.2.1. Generalities on sampling 


Together with the enumerative questions, much interest has been given to the 
properties of random animals, polyominoes and polygons. By random is meant 
here the uniform distribution: objects of equal size are given equal probability 
to appear. We illustrate this trend by concentrating on the derivation of random 
generators. In order to describe these algorithms, we assume that we have at 
our disposal a perfect random number generator RAND(m,n) that outputs an 
integer of the interval [m,n] chosen with uniform probability: for allm <i <n, 


P(RAND(m,n) =i) = 1/(n—m+1). 


We assume unit cost for arithmetic operations and for calls to the generator 
RAND(). These randomness and complexity models are justified by the fact 
that our algorithms only sample and compute on integers that are polynomially 
bounded in the size of the objects generated. 

We shall need a random sampler for elements of G(w), the set of permu- 
tations of the letters of a fixed word w. The following algorithm does this by 
applying a random permutation to the letters of w. 
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Figure 9.10. A parallelogram polyomino and its contour. 


RANDPERM(w) 
1 fori 2 to |w| do 
2 SWAP(w|?], w[RAND(1, ¢)]) 


3. return w 


LEMMA 9.2.1. RANDPERM(w) returns in linear time a random element of 
G(w) under the uniform distribution: for all w' € G(w), 


1 
P(RANDPERM(w) = w’) = ————.. 
( Mh) =) Sta) 
Proof. A permutation o on the set {1,...,n} has a unique decomposition as a 


product o = T,...72 of transpositions of the form 7; = (j;,7) with 1 < 9; < 4, 
and conversely any such decomposition provides a permutation. Therefore, 
the call RANDPERM(w) on a word w with distinct letters generates a uniform 
random permutation of the letters. Upon labelling identical letters by their 
initial place, we conclude that uniformity is also preserved in the general case. 

a 


In the rest of this part, we describe random sampling algorithms for convex 
polygons and directed animals. 


9.2.2. Parallelogram polyominoes and the cycle lemma 


A convex polyomino P is a parallelogram polyomino if its contour contains the 
bottom left and top right corners of its bounding box. Equivalently, its contour 
must be a staircase polygon, 1.e. a polygon made of two up-right directed paths, 
meeting only at their extremities. These upper and lower paths, being directed, 
can be coded with two letters. For later purpose, it will be convenient to code 
them on the alphabet {h, v}, with h standing for a horizontal step and v standing 
for a vertical step. Starting from the bottom left corner, let vw,h be the word 
coding the upper path, and hwav be the word coding the lower path (there 
is no choice for first and last letters). If P has dimension (p+ 1,q+ 1) then 
lwiln = |wel,n = p and |wi|, = |we|, = g. The reduced code of a staircase 
polygon w is the word on the alphabet A = {(°), (°), ((), (“)} obtained by 
stacking the two words w, and wz. In the example of Figure 9.10, the two paths 
are respectively vw, h = v- vhvhvvhhhvhh:-h and hwov = h-hhvhvhvvhhvh-v. 
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Words on A that code for staircase polygons are characterized by the facts 
that they have an equal number of letters h in both rows, and that their prefixes 
contain at least as many letters () as letters tee indeed, the morphism 6 
induced by {6(7) =1, 6(") = —1, 6(}) = 6(°) = 0} measures the distance 
between the upper and lower paths along diagonals, and the positive prefix 
property expresses the condition that the upper and lower paths do not meet 
before their endpoint. Codes of staircase polygons are thus essentially bicolored 
Motzkin words. 

This characterization suggests to construct staircase polygons by applying 
the cycle lemma to words of the set S(p,q) of words of length p+ q+1 on A 
with p+ 1 letters h and gq letters v in the first row, and p letters h and g+1 
letters v in the second row: 


STAIRCASE(p, q) 
1 ww — RANDPERM(h?+1v1) > generate w! = (") € S(p, q) 
2 wh — RANDPERM(h?v!t') 
3° (m,6m) — (0,0) > seek the position m of the 
4 6<—0 > leftmost minimum w.r.t 6 
5 fori-—1top+q+1do 
6 if (w} [2], w5[i]) = (v, h) then 
7 
8 


d—d+1 
elseif (w4[#], w5[¢]) = (h, v) then 
9 6-6-1 
10 if 6 < 6 then 
11 (m, om) <— (4,6) 
12) (wih, wev) — SHIFT((w}, wh),m) > get the conjugate at position m 


13 return (vwyh, hw2v) 


PROPOSITION 9.2.2. STAIRCASE(p, g) produces the code of a random uniform 
staircase polygon with dimension (p+ 1,q+ 1) in linear time. 


Proof. Let us first use the cycle lemma to derive the number of staircase poly- 
gons. The number of words in S(p, q) is Card(S(p, q)) = eo) aria? Then 
among the p+q+1 cyclic shifts of any word w’ € S(p,q), exactly one is of 
the form w(") with w having the positive prefix property. Hence the number of 


1 Ce) ee 
: pratl\ q por 
The algorithm STAIRCASE() generates a word uniformly at random in the 


set S(p,q), and computes its unique cyclic shift coding for a staircase polygon. 
The probability to get the code of a given polygon P is thus the sum of the 
probability to get each of its cyclic shifts. But the code of P admits p+q+1 
distinct cyclic shifts, and each of these word has probability 1/ Card($(p, q)) to 
be obtained. Thus the probability to get P is (p + q+ 1)/Card(S(p,q)), ie. 
depends only on the dimension of P: uniformity is preserved through the cycle 
lemma. rT 


staircase polygons with dimension (p+ 1,q+ 1) is 
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Figure 9.11. A directed convex polyomino and its contour. 


9.2.3. Directed convex polyominoes and Catalan’s factorization 


Directed convex polyominoes are characterized among convex polyominoes by 
the property that their contour contains the bottom left corner of their bounding 
box. In other terms contours of directed convex polyominoes are unimodal 
polygons, i.e. shuffles of a word of the language u*d* and a word of the language 
r*l*. Let us consider an unimodal polygon with dimension (p+ 1,q+ 1), and 
decompose it into an upper path and a lower path both starting from the bottom 
left corner and of length p+ q+ 2, and respectively obtained in clockwise and 
counterclockwise direction. Let w/ and w4 be the codes of these two paths on the 
alphabet {h,v}. In the example of Figure 9.11, the two paths are respectively 
w, = vhvhvvhvhvhhvv and wh = hhhhvhvhhhvhvh. The following properties 
of wi are immediate consequences of the definition of unimodal polygons: 

1. the word w} starts with a letter v; 

2. it contains at least q+ 1 letters v; 

3. the first g+ 1 letters v code up steps, the other ones down steps; 

4. the (q+ 1)th letter v is followed by a letter h. 
The last property accounts for the right turn that the path has to make when 
reaching the upper boundary. Define the reduced code w, as obtained from 
w; by deleting the two redundant letters given by Properties 1 and 4 above. 
Similarly the reduced code wy is obtained by deleting from w% the first letter 
(that is a letter h) and the letter following the (p+1)th letter h (that is a letter 
v). Let w be the word on A obtained by stacking w; and w2. Then again all 
prefixes of w contain at least as many letters (A as letters Ci It turns out that 
this condition is sufficient for w to code an unimodal polygon: this is expressed 
by the following lemma, the proof of which is left to the reader. 


LEMMA 9.2.3. A word w on A is the stacked reduced code of an unimodal 
polygon with dimension (p + 1,q +1) if and only if all its prefixes contain at 
least as many letters (?) as letters ("), and, viewed as a pair of words on {h, v}, 


it contains 2p letters h and 2q letters v. 


In terms of the morphism 6 of the previous section, Lemma 9.2.3 implies 
that a word of A®* is the code of an unimodal polygon if and only if it is a prefix 
of Motzkin word on (A, 6). These prefixes are similar to prefixes of Dyck words 
with 6 even, and the proof of Proposition 9.1.2 suggests the following algorithm. 
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Figure 9.12. A convex polyomino and its contour. 


UNIMODAL(p, q) 


1 wy — RANDPERM(AP v2) > generate w = cc with d(w) = 0 

2 wa — RANDPERM(h?v‘) 

3 6<0 

4 }bm<0 

5 fori-1top+qdo 

6 if (wi[2], we[i]) = (v, hk) then 

7 d6—d+1 

8 elseif (w:[#], we[i]) = (hk, v) then 

9 6-6-1 
10 if 6 < 6,, then > leftmost minimum found 
11 (dm; W1 [i], weli]) — (6,v,h) > down step to up step 


12) return (wi, we) 


PROPOSITION 9.2.4. UNIMODAL(p, g) produces the reduced code of a random 
uniform unimodal polygon with dimension (p + 1,q + 1) in linear time. 
Proof. Lines 1, 2 of the algorithm construct a word ce) satisfying 6 (3) =0. 
A straightforward adaptation of the bijection used for Proposition 9.1.2 shows 
that these words are in one-to-one correspondence with prefixes of Motzkin 
words: for the current 6, steps (”) play the role of up steps, steps (”) that down 
steps, and Motzkin factors replace Dyck factors. The algorithm implements the 
inverse bijection, replacing leftmost down steps at negative levels by up steps. 
Since the word (i) is taken uniformly in the set of words with p letters h 
and q letters v in both lines, its image is uniform in the set of bicolored Motzkin 
prefixes with 2p letters h and 2q letters v. 7 


As a corollary of the previous proof, we also see that the number of unimodal 
: . . (ptq\2 
polygons of dimension (p + 1,q + 1) is ( : iz 


9.2.4. Convex polyominoes and rejection sampling 


The contour of a convex polyomino with dimension (p + 1,q+ 1) can be coded 
as follows by a pair (w’,k): start from the upper point of the contour on the left 
boundary, and code the path in clockwise direction by a word w’ with letters h 
and v as previously; let moreover k be the distance of the startpoint to the top 
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border of the bounding box (see Figure 9.12). From the geometry, the following 
properties of the word w’ are immediate: 

1. there are 2p + 2 letters h and 2q + 2 letters v; moreover 0 < k < q; 

2. the first p+ 1 letters h code right steps, the other p+ 1 left steps; 

3. the first k letters v code up steps, the next g-+ 1 down steps, and the final 

q+1-—k up steps again; 

4. the first letter is a letter h; 
5. if k > 0 then the kth letter v is followed by a letter h; 
6. the (p + 1)th letter h is followed by a letter v; 
7 
8 


. the (k +q + 1)th letter v is followed by a letter h; 
. the (2p + 2)th letter h is followed by a letter v; 
9. the letters singled out in 4, 5, 6, 7, and 8 above appear in this order. 
These properties do not completely characterize the codes of convex polygons, 
but this is almost the case, as the reader will verify: 


LEMMA 9.2.5. A pair (w’,k) satisfying the nine properties above is the code 
of a convex polygon if and only if the corresponding walk is a polygon, that is, 
if it does not visit twice the same point. This property can be checked in linear 
time by the following algorithm. 


CHECKSIMPLE(w’, k) 


1 (t41,01,€1) — d,q+1—k,4+1) traversal of w’ from the left 
2 (i2,02,€2) — (2p + 2q + 3,q —k, 1) traversal of w’ from the right 
3 for ?—1top+1do £ counts horizontal steps 
4 while w’|[i;] =v do vertical move on top 
5 (41,01) — (ii +1, 01 + €1) 
6 while w’ [iz] = v do vertical move on bottom 
7 (ta, 62) = (ig = 1, 09 a £2) 
8 if 6; < 69 then self-intersection detected 
9 return FALSE 
10 if 6; =q+1 then top reached 
11 ey —l 
12 if 62 = 0 then bottom reached 
13 €g-—+1 
14 (i1,22) — (41 + 1, tg — 1) next column 


15 return TRUE 


The reduced code (w,k) of a convex polygon is obtained by deleting the 
redundant letters given by Properties 4, 6, 7, 8, and if k > 0 by Property 5. 
The reduced word w has thus, if k = 0, 2p letters h and 2g letters v, or, if k > 0, 
2p—1 letter h and 2q letters v. Given the reduced word w and the index & there 
is an immediate algorithm INSERTREDUNDANTLETTERS(w, k) that reconstructs 
w’ by inserting the missing letters from left to right. 

The following generator is based on the rejection principle: words of a su- 
perset of the set of codes are generated uniformly at random until a proper code 
is obtained. 
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Figure 9.13. A directed animal and the equivalent strict pyramid. 


CONVEX(p, q) 

1 do k++ RaAnpv(0,q) 

2 w <— RANDPERM(h??v24) 

3 if k = 0 or w[2p+ 2q] =h then 

4 w’ — INSERTREDUNDANTLETTERS(w, k:) 
5 if CHECKSIMPLE(w’, k) = TRUE then 
6 return (w’,k) 
7 while TRUE 


PROPOSITION 9.2.6. CONVEX(p,q) produces the code of a random uniform 
convex polygon with dimension (p+1,q+ 1). 


Proof. The fact that the output is uniform follows from the following standard 
rejection argument: when the algorithm stops, the probability to output a given 
code is proportional to the probability to get this code as an element of the 
superset; but elements of the superset are sampled uniformly, i.e. have the 
same probability to be generated. a 


The expected complexity of the algorithm CONVEX() depends on the com- 
parison between the size (q¢ + i) of the superset S,,, in which k and w 
are sampled, and the size of the set P, of convex polygons with dimension 
(po +1,q¢+1). More precisely, each loop takes linear time, the probability of 
success of a loop is Sp,q = Card(Py,q)/ Card(Sp,,), and the number of loops is a 
geometric random variable with expectation 1/s,,. The explicit computation 
of Card(P,q) shows that this last value is bounded by a constant, but we do 
not include the details here (see Problem 9.2.2). 


PROPOSITION 9.2.7. The call CONVEX(p, q) has expected linear complexity. 


9.2.5. Directed animals 

Upon rotating the lattice counterclockwise by 7/4, directed animals can be 
given an elegant interpretation in terms of heaps of bricks: cells are viewed 
as bricks exposed to the gravity law with the bottom brick lying on the floor; 
the condition that animals are directed, i.e. that there always exists a path 
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downward to the bottom cell, is equivalent to the fact that every brick leans on 
one brick below and cannot fall. 

To be more precise, let us give a definition of heaps of bricks. The alphabet 
of bricks is B = {(i,i + 1),i € Z}. Two bricks b, b’ of B commute if and only 
if, as subsets of Z, bN b' = 0. Two words are equivalent, w = w’, if one can be 
obtained from the other by a sequence of commutations of adjacent commuting 
bricks. A heap of bricks is an element of the associated partially commutative 
monoid, i.e. an equivalence class for the relation =. The set of minimal bricks 
of a heap w is the set min(w) = {b | dw’, w = bw’}. A pyramid at abscissa i is 
a heap such that min(w) = {(i,i+ 1)}. 

The canonical geometric representation of a heap induced by the gravity law 
corresponds to the standard Cartier-Foata normal form of the heap: reading 
a heap from left to right in lines from bottom to top yields a word w of the 
form wy ,--: wr with each block w; made of commuting letters and such that for 
each letter b of wj11 there is a letter b’ of w; with bb! 40. A heap is strict if 
moreover no two consecutive blocks of the normal form have a brick in common: 
in other terms in a strict heap a brick (i,i + 1) always lean on a brick (7 — 1,7) 
or (i +1,i+ 2), not on another brick (7,7 + 1). 

From the geometric interpretation of pyramids of bricks and the initial dis- 
cussion of this paragraph, the following lemma is immediate. 


LEMMA 9.2.8. Directed animals “are” strict pyramids of bricks. 


This interpretation of directed animals in terms of pyramids of bricks allows to 
perform decompositions that would otherwise be very difficult to explain. First 
define a semi-pyramid to be a pyramid without bricks on the left hand side 
of the bottom brick. Then the following two decompositions are obtained by 
pushing upward a brick and all the bricks that lay above it, or indirectly lean 
on it: 

— a strict pyramid of bricks is either a strict semi-pyramid, or can be fac- 
tored, by pushing upward the lowest brick with abscissa —1, into a strict 
pyramid at abscissa —1 stacked over a strict semi-pyramid; 

— a strict semi-pyramid is reduced to a brick, or to a strict semi-pyramid 
at abscissa 1 over a brick, or can be factored, by pushing upward the 
second lowest brick with abscissa 0, into a strict semi-pyramid at abscissa 
0 stacked over a strict semi-pyramid at abscissa 1 over a brick. 

This joint decomposition is isomorphic to the joint decomposition of prefixes of 
words and of words of the Motzkin language on the alphabet {a, b, x1}: 

— a prefix of Motzkin word is either a Motzkin word or can be decomposed 
as uav with u a Motzkin word and v a prefix of Motzkin word. 

— a Motzkin word is reduced to the empty word ¢, or is of the form x u 
with u a Motzkin word, or can be decomposed as aubv with u and v two 
Motzkin words. 

These isomorphic decompositions induce a bijection between strict pyramids of 
n bricks and prefixes of Motzkin words of length n — 1. 
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Figure 9.14. Decomposition of pyramids. 


COROLLARY 9.2.9. Prefixes of Motzkin words can be bijectively transformed 
into strict pyramids of bricks in linear time. 


The Motzkin language being algebraic, uniform random generation could be 
done using a recursive approach. We describe instead another application of 
the rejection principle which is both more elegant and more efficient for this 
specific problem. Let us consider again the alphabet A, = {u,d,x1,...,x%} 
and the associated k-colored Motzkin words of Section 9.1.3. A naive algorithm 
to generate uniform random prefixes of k-colored Motzkin words of length n 
consists in generating uniform random words of (A;)”" and rejecting. However 
a simple calculation shows that the probability of success is of order O(n~!/?) 
thus giving an algorithm with expected complexity O(n3/?). A slight refinement 
on this idea is to observe that rejection can be decided on the fly. This turns 
out to be surprisingly efficient. 


FLORENTINEREJECTION(n, k) 


1 do wee 
2 for 1 1tondo > generate from left to right 
3 wii] — RAND(1, k + 2) 
4 if w[t] =k+1 then 
5 d—d+1 
6 wii] <u 
7 elseif wi] =k +2 then 
8 d6—d-1 
9 wii] —d 
10 if 6 <0 then > if a negative prefix is detected 
11 break > restart from scratch 
12 whileiAn+1 > until w is a valid n letters word 


13 return w 


This algorithm obviously produces a prefix of k-colored Motzkin word. 
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LEMMA 9.2.10. The function FLORENTINEREJECTION(n, k) generates a ran- 
dom uniform prefix of k-Motzkin word of length n in expected linear time. 


Proof. For simplicity the analysis is presented in the case k = 0 but the same 
strategy of analysis applies to the general case (using generating functions in- 
stead of elementary counting). It will be convenient to consider that when the 
construction fails at the ith step of the inner loop, we finish the loop and gener- 
ate n—7 more letters at no cost. This modification of the algorithm do not affect 
the final result or the cost, but allow us to think at each iteration as produc- 
ing a uniform random word of (A;)". From this point of view, the Florentine 
rejection behaves like standard rejection and therefore it is uniform on prefixes. 

The probability of success of the inner loop is p, = ee? me = Pn, and the 
number of aborted loops is a geometric random variable with expected value 
1/pyn = O(n'/?). Let us now compute the expected cost of a failure: a failure 
with cost 2i+1 is obtained for a word w of the form ubv with u a Dyck word of 
length 2% and v in {a, b}?”~?"!. Hence the cumulated cost for all these 2?” — (*”) 


words is 07, (i C21 So 2 = On), With 


i=0 =0 
O(n'/2) aborted loops with cost O(n!/?) each, and one successful loop with cost 
n, the total expected cost is linear as announced. a 


Florentine rejection thus uses on average a linear number of random bits. 
As opposed to this a call to RANDPERM(w) for a word w of length n uses about 
nlogn bits, and this is in general suboptimal from a theoretical point of view. 
For instance for w = a”b", log 7”) ~ 2n bits should suffice. In this case an 
optimal solution (on average) is obtained using FLORENTINEREJECTION(n, 0) 
to get a prefix of Dyck words and Catalan’s factorization (Proposition 9.1.2) to 
transform it into a word of G(a"b"). As opposed to this, it is an open problem 
in general to sample in linear time from G(w) using O(log Card(G(w))) random 
bits. 


9.3. Coding: trees and maps 


A planar map! is a proper embedding of a connected graph in the plane. Mul- 
tiple edges and loops are allowed, and proper means that edges are smooth 
simple arcs which meet only at their endpoints. The faces of a planar map are 
the connected components of the complement of the graph in the plane: apart 
from one infinite face, all faces are bounded and homeomorphic to disks. All 
the planar maps we consider are rooted: they have an oriented edge, called the 
root, which is incident to the infinite face on its right-hand side. Examples of 
rooted maps are presented in Figure 9.15. 

From now on we shall consider that two planar maps are the same if one 
can be mapped onto the other (including roots) by an homeomorphism of the 
plane. However there are still many more planar maps than planar graphs, as 
illustrated by Figure 9.15. Indeed homeomorphisms of the plane respect the 


1The word map is intended here in its geographic sense, like in road-map. 
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Figure 9.15. Two rooted planar maps with the same underlying graph. 


neighborhood of each vertex, so that the circular order of edges around vertices 
is fixed. 

From a combinatorial point of view, a planar map can in fact entirely be 
specified as follows: label half-edges (or darts) and for each half-edge give the 
names of the opposite half-edge, and of the next half-edge around its origin in 
counterclockwise direction. As a consequence the number of planar maps with 
n edges is finite. Moreover these labeled maps capture exactly the level at which 
algorithms on maps are implemented in computational geometry, using darts 
as elementary data structures. Carrying on with labeled maps, one could also 
reach a purely combinatorial setting and eliminate the geometry (at least at 
the formal level of proofs). However for the sake of conciseness it appears more 
efficient to keep higher level geometric arguments. 

Examples of specific families of planar maps are numerous. A triangulation 
of a k-gon is a planar map without multiple edges such that all bounded faces 
have degree 3 and the infinite face has degree k (the degree of a face is the 
number of sides of edges to which it is incident). A k-valent map is a planar 
map such that all vertices have degree k (the degree of a vertex is the number 
of half-edges to which it is incident). 


9.3.1. Plane trees and generalities on coding 


A rooted plane tree, or hereafter simply a plane tree is a planar map with one 
face. A planted plane tree is a plane tree such that the root vertex has degree 1. 
A binary tree is a planted plane tree with vertices of degree 3 and 1 only, respec- 
tively called nodes and leaves. These definitions agree with classical recursive 
definitions of plane trees: for instance a plane tree can be decomposed as an 
ordered sequence of subtrees attached to the root. 

The contour traversal of a planar map is the walk on the vertices and edges 
of the map that starts from (the right-hand side of) the root edge, and turns 
around the map in counterclockwise direction so as to visit the boundary of 
the infinite face. (The reader is encouraged to imagine an ant walking around 
the map.) The contour traversal of a plane tree visits in particular twice every 
edge: the first time away from the root vertex, and the second time toward the 
root vertex. The preorder on the vertices of a planted plane tree is defined by 
ordering vertices according to the first passage of the contour traversal. 

The Dyck code of a planted plane tree with n+ 1 edges is the word of length 
2n on the alphabet {u,d} obtained during a contour traversal of the tree by 
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ce 


Figure 9.16. A planted plane tree and its Dyck code. 


Figure 9.17. A planted binary tree and its prefix code. 


writing a letter wu each time a non-root edge is visited for the first time (away 
from the root vertex), and a letter d each time a non-root edge is visited for the 
second time (toward the root vertex). The reader should convince himself that 
the Dyck code of a tree characterizes it. 


LEMMA 9.3.1. Dyck encoding is a bijection between planted plane trees with 
n+ 1 edges and Dyck words of length 2n. In particular the number of planted 
plane trees with n+ 1 edges is the nth Catalan number. 


The prefiz or Lukasiewicz code of a planted plane tree with n edges is the 
word of length n on the alphabet {x;,7 > 0} obtained during a contour traversal 
of the tree by writing a letter x; each time a non-root vertex with degree 7+ 1 is 
visited for the first time. Let us define the morphism 6 by d(#;) =i—1. Then 
the prefix code w of a planted plane tree has the Lukasiewicz property (t.e. for 
each strict prefix v of w, 6(v) > 6(w)). In particular, upon setting 72 = u and 
xo = d, we obtain the following lemma for the case of binary trees: 


LEMMA 9.3.2. Prefix encoding is a bijection between binary trees with n no- 
des, (and thus n+ 2 leaves and 2n+ 1 edges) and words of length 2n + 1 of the 
Dyck-Lukasiewicz language Dd. In particular the number of binary trees with 
n nodes is the nth Catalan number. 


Recall that the optimal coding problem for a family C of combinatorial struc- 
tures consists in finding a function y that maps injectively objects of C on words 
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of {0,1}* in such a way that an object O of size n is coded by a word y(O) of 
length roughly bounded by log, Card(C,,), with C,, the set of objects of size n. 
Since the nth Catalan number satisfies log C,, ~ 2n as n goes to infinity, Dyck 
codes and prefix codes respectively solve the optimal coding problem for plane 
trees and for binary trees. On the other hand, the Dyck code of a binary tree 
with n nodes has length 4n+ 2, so that Dyck codes are far from optimality with 
respect to the family of binary trees: the optimality of a code is relative to the 
entropy logC,, of the set C,, under consideration. 

More generally, consider the set of planted plane trees with d; nodes of 
degree i (and thus = 14+ )7(i—2)d; non-root leaves). Prefix encoding defines a 
bijection between this set of trees and the subset of words of S(a$xf! ...a¢*) that 
have the Lukasiewicz property. But according to the cycle lemma, the fraction 
of such words of length n among words of same length in S(a}2% ... a) is 1/n. 
Now words on a finite alphabet with fixed proportion of letters can be encoded 
optimally by the so-called entropy coder. Hence prefix encoding combined with 
entropy encoding yields optimal coding for plane trees with a fixed proportion 
of nodes of each degree. 


9.3.2. Conjugacy classes of trees 


From now on, we consider planted plane trees with two types of vertices of 
degree 1, respectively called buds and leaves. Vertices of higher degree are 
called nodes. In particular, a blossoming tree is a planted plane tree such that 
each node has degree 4 and is adjacent to exactly one bud; a blossoming tree 
with n nodes has thus n + 2 leaves and n buds. Examples of blossoming trees 
are given in Figure 9.18. 


LEMMA 9.3.3. The number of blossoming trees that are planted on a leaf and 


have n nodes is ao"). The number of blossoming trees that are planted on 
nt+1\n 7 
a bud and have n nodes is 25 (,7")). 


Proof. Let Bj, and Bi’ denote these two sets of blossoming trees. A blossoming 
tree of the first type can be uniquely obtained from a binary tree with n nodes 
by attaching a bud to each node in one of the three possible ways. Together 
with Lemma 9.3.2, this proves the first formula. 

Now let us consider the set of doubly planted blossoming trees, one root being 
a leaf and the second one a bud. Such a tree with n nodes can be considered 
either as a blossoming tree in B’, with a marked bud, or as a blossoming tree 
in 6) with a marked leaf. Hence doubly planted blossoming trees with n nodes 
are either counted by nCard(6},) or by (n + 2) Card(B"’). As a consequence, 


Card(Br) = yas - 257"), which proves the second formula. 2 


Let T be a planted plane tree with n nodes. During a contour traversal of 
T, its buds and leaves are visited in a sequence (by convention the root vertex is 
visited at the end of the traversal). Accordingly the border word is the word with 
letters {b, 2} obtained along the contour traversal by writing a letter b each time 
a bud is visited and a letter @ each time a leaf is visited. For example, the border 
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(a) A blossoming tree, (b) and a balanced one. 


Figure 9.18. Two conjugate blossoming trees. 


words of the blossoming trees of Figure 9.18 are respectively €@bCbLCbbLbLLbLbbE 
and blbeCbbLbLebLbbELE. 

Two planted plane trees T and T” are conjugate if one is obtained from the 
other by re-rooting. In other terms, two planted plane trees are in the same 
conjugacy class of trees if they share the same underlying unrooted plane tree. 
This terminology is motivated by the remark that conjugate planted plane trees 
have conjugate border words. Taking 6(b) = +1 and 6(¢) = —1, the cycle lemma 
suggests the following definition: a planted plane tree is balanced if its border 
word has the Lukasiewicz property. With this definition, and remembering that 
blossoming trees have two more leaves than buds, the cycle lemma for those 
trees reads: a blossoming tree has exactly two canonical leaves such that the 
conjugate trees rooted at these leaves are balanced. 


2 3" 


LEMMA 9.3.4. There are n+2 ntiln 


) balanced blossoming trees with n nodes. 


Proof. The first proof is again based on a double counting argument. Let B* 
be the set of balanced blossoming trees with n nodes. The number of balanced 
blossoming trees with a secondary root leaf is (n+2) Card(B*). Upon exchanging 
the role of the two roots, these trees are also blossoming trees with a secondary 
root leaf taken among the two canonical leaves: their number is thus 2- 2+ ual 


n+1 
The result follows. r 


Proof (bis). An alternative proof is based on the following remark: the number 

of balanced re-rootings of any blossoming tree is equal to the difference between 

its numbers of leaves and buds, so that, in each conjugacy class of trees, the 

number of balanced trees is exactly the difference between the number of trees 

rooted on a leaf and the number of trees rooted on a bud. Hence the number of 
3 


balanced blossoming trees with n nodes is the difference aan eal - cae (oe 
rT 
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(a) A fusion, (b) the partial closure (c) and the complete 
closure. 


Figure 9.19. The closure of the balanced blossoming tree of Figure 9.18(b). 


9.3.3. The closure of a plane tree 


The closure of a planted plane tree with two more leaves than buds is obtained 
by repeating the following construction until only two leaves remain: perform a 
contour traversal, and each time a leaf follows a bud in the sequence of vertices of 
degree 1 met by the walk, match them, i.e. fuse the two corresponding dangling 
edges in the unique way that creates a bounded face with no vertex of degree 1 
inside (see Figure 9.19(a)). 


LEMMA 9.3.5. The closure of a plane tree with n nodes and two more leaves 
than buds terminates and produces a planar map with the same n nodes and 
two leaves, which are both incident to the infinite face. In particular the closure 
of a blossoming tree has n vertices of degree four, plus two of degree one in the 
infinite face. 

If moreover the tree is balanced, then its root vertex is one of the two re- 
maining leaves. 


Proof. At each iteration all factors b@ of the border word are detected, and 
deleted since the corresponding pairs of bud and leaf are matched. In particular 
at least one pair is matched at each iteration, so that the construction termi- 
nates. Vertices of degree at least two remain unchanged while all buds and 
leaves are eliminated but the two canonical roots. 7 


As described above the closure could require a quadratic number of opera- 
tions. The following algorithm takes a planted plane tree with two more leaves 
than buds and computes its closure in linear time. It uses the following items: 

a local stack with functions PUTINSTACK(), POPFROMSTACK() and Is- 
STACKEMPTY(), 

a function NEXTFREEVERTEX(vertez ) that starts a contour traversal after 
the vertex of degree 1 vertex and returns the first vertex of degree 1 found, 
a function TYPE(vertex) that tells whether vertex is a bud or a leaf, 

a function FUSEINTOEDGE(bud, leaf’) that realizes the fusion of a bud bud 
and a leaf leaf into an edge. 


| 
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CLosuRE(T) 
1 nm < NUMBEROFLEAVES(T) 


2 verter — RooTOF(T) 
3 (¢1, l2) — (vertex, vertex) 
4 while n > 2 do 
5 verter — NEXTFREEVERTEX (verter) 
6 if TyPE(vertex) = bud then 
7 PuTINSTACK(verter) 
8 elseif ISSTACKEMPTY() then 
9 (C1, €2) — (22, verter) 
10 else bud — POPFROMSTACK() 
a FuSEINTOEDGE(bud, vertex) 
12 n—n-1l 
13. if Ly = by then 
14 fy — NEXTFREEVERTEX (verter ) 


15 return (T, ¢1, ¢2) 


REMARK 9.3.6. Lines 13 and 14 only treat the special case of a balanced blos- 
soming tree in which the second free leaf is the last one of the border word. 


The complete closure of a balanced blossoming tree is obtained from its 
closure by fusing the two remaining vertices of degree 1 and the incident dangling 
edges into a root edge. Lemma 9.3.5 implies that the complete closure of a 
blossoming tree with n nodes is a 4-valent map with n vertices. The following 
more precise theorem will be proved in the next section. 


THEOREM 9.3.7. The complete closure is one-to-one between balanced blos- 
soming trees with n nodes and 4-valent maps with n vertices. In particular the 
number of these maps is =o ( : 

As a corollary we already have the complete description of a random sam- 
pling algorithm for 4-valent maps with n vertices. Apart from the function 
CLOSURE(), it uses the random generator FLORENTINEREJECTION() defined in 
Section 9.2 and the following items: 

— a function PREFIXDECODE(w) that constructs the binary tree encoded by 

a Dyck-Lukasiewicz word w, 

— a function ADDBuD(n, 7) that adds a bud to a node n in one of the three 

possible manners, 

— a function ADDROoT(M, ¢1, £2) that roots the map M by fusing its two 

leaves ; and (2 into an oriented edge. 
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RANDMapP(n) 
1 w< FLORENTINEREJECTION(n, 0) 

T — PREFIXDECODE(wd) 

for node € T do 
ADDBuD(node, RAND(1,3)) 

(M, €1, €2) — CLOSURE(T) 

if RAND(1,2) = 1 then 
AppRootT(M, 1, £2) 

else ADDRooT(M, é2, £1) 

return WV 


OCOANDOTKW bY 


COROLLARY 9.3.8. RANDMAP(n) outputs a uniform random 4-valent map 
with n vertices in linear time. 


9.3.4. The opening of a 4-valent map 


The dual of a planar map M is the planar map M~* defined as follows: in each 
face of M put a vertex, and join these new vertices by edges dual to the edges 
of M. By construction the vertices, edges and faces of M* are respectively in 
bijection with faces, edges and vertices of 1/7. This construction is illustrated by 
Figure 9.20(a). The proof of the following property of duality in planar maps is 
left to the reader. 


LEMMA 9.3.9. Let (£1, £2) be a partition of the set of edges of a planar map 
M. Then EF, is a spanning tree of M if and only if E3 is a spanning tree of M*. 
When this case we call (EF, E2) a spanning tree decomposition of M. 


From now on, let M be a planar map, and (£1, F2) be a spanning tree 
decomposition of M. For e an edge of E2, opening e with respect to (£1, E2) 
will mean: orienting e so that the cycle it induces with the tree FE, is counter- 
clockwise, and then replacing e by two dangling edges, the one attached to the 
origin of e holding a bud b(e), the other one holding a leaf ¢(e). We shall always 
assume moreover that the root r of M belongs to Fy. Then, the opening of M 
with respect to (FE, £2) is the tree T defined as follows: (see Figure 9.20(c)) 

— open each edge e € E> with respect to (£1, F2), 

— replace the bud b(r) by a leaf and plant the tree on it. 

The tree T thus consists of the edges of the spanning tree £, together with 
pairs of dangling edges associated to edges of Ez. More precisely, these edges 
contribute to one bud and one leaf except for the root which contributes to two 
leaves. By construction, the opening T of a 4-valent planar map M with n 
vertices has n nodes of degree 4, n buds and n+ 2 leaves. 


LEMMA 9.3.10. The complete closure of the opening of a planar map M with 
respect to any spanning tree decomposition is the planar map M itself. 


Proof. The opening of an edge merges the two faces incident to it. Since E3 
forms a spanning tree of 1/*, the openings can be performed sequentially so that 
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(a) A map and its dual, (b) a spanning tree de- (c) and the correspond- 
composition, ing opening. 


Figure 9.20. An opening of the map of Figure 9.19(c). 


one of the two merged faces is always the infinite face. It is then immediate 
at each step that the pair of bud and leaf created by the opening of an edge 
corresponds to a matched pair in the closure. 7 


There are in general many spanning tree decompositions of M, and the right 
one must be chosen to invert the closure. To explain how this is done we need 
to introduce the distance in the dual map M*: two faces of M are adjacent if 
they share a common edge, and the distance between two faces f and f’ is the 
length & of the shortest path (fo,..., fx) where fo = f, fy = f’ and for all 7, the 
two faces f; and f;-; are adjacent. Observe that the dual /* of a 4-valent map 
has only faces with even degrees (in fact degree 4), so that it does not contain 
any cycle of odd length, and the distances of a face f to two adjacent faces f’ 
and f” always differ by 1. 

To each face f of M, associate the face r(f) incident to the root edge r and 
closest to f for the distance in M*. The set P(f) of paths of minimal length 
from r(f) to f forms a bundle of paths bounded by two paths Po(f) and Pi(f), 
with Po(f) having the bundle on its right hand side. We shall call Po(f) the 
leftmost minimal path from the root to f. The union of r* and of the edges of 
the paths Po(f) for all faces f of M7 forms a spanning tree of M*: the existence 
of a cycle would prevent one of the paths from being leftmost. This tree is called 
the leftmost breadth first search tree of M* starting from r*, because it is also 
given by a breadth first search traversal with the left hand rule. As stated in 
the following proposition, it is the spanning tree we are looking for. 


PROPOSITION 9.3.11. Let M bea 4-valent map with root edge r and (F\, E2) 
be a spanning tree decomposition such that r € E2. Then the opening of M 
with respect to (E,, £2) is a blossoming tree if and only if E35 is the leftmost 
breadth first search tree of M* starting from r*. 


The proof of this proposition is based on two lemmas. The first one is a 
characterization of blossoming trees. 
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LEMMA 9.3.12. A tree T with n buds, n+ 2 leaves and n nodes of degree 4 is 
a blossoming tree if and only if, for every inner edge e, the two components of 
T \ e both contain one more leave than buds. 


Proof. The characterization is trivial for n = 1, and remains true when a further 
node with two leaves and a bud is attached in place of a leaf. The lemma thus 
follows by induction since every tree can be obtained by adding new nodes 
incrementally. a 


For the second lemma it is useful to view the spanning tree EZ as rooted on 
r*, with the convention that the infinite face of MM is the origin of the root. 


LEMMA 9.3.13. Let e be an edge of FE, separating two faces f, f’, with f 
before f' in the leftmost depth first order on the tree E3. Consider the paths 
P and P' from f and f’ to their common ancestor in E3, which define with e* 
a cycle separating a bounded region B of the plane from an unbounded one U. 
Then, 
— the opening of an edge of P with respect to (E1, F2) creates a leaf in B 
and a bud in U, 
— and the opening of an edge of P’ with respect to (FE, E2) creates a bud 
in B and a leaf in U. 


Proof. The result is immediate upon comparing the orientation used in the 
definition of the opening of an edge and the orientation of the cycle going from 
e* up the path P and down the path P’. a 


Proof of Proposition 9.8.11. First assume that E> is the leftmost breadth first 
search tree of M* starting from r*, and let T be the opening of M with respect 
to E53. According to Lemma 9.3.12, it suffices to check that for any edge e of £4, 
both components of T \ e contain one more leaves than buds. Let us consider 
the paths P and P’ of Lemma 9.3.13. The breadth first search condition on E> 
implies that the length of these two paths differ at most by 1, hence exactly by 
1, in view of the discussion of distances in M*. The leftmost condition on E3 
moreover implies that the shortest path of the two must be P’. Finally observe 
that two components of T \ e are separated by the dual cycle of Lemma 9.3.13, 
so that this lemma can be used to count buds and leaves in the two regions. 
This can be done easily upon distinguishing whether r* is on P or not. 

Let now £3 be a spanning tree of M™* different from the leftmost breadth 
first search tree #5". Then there are leftmost minimal paths that do not appear 
in £3. Among the shortest of them let Po(f) be the leftmost one, connecting the 
root to a face f. Since Po(f) is minimal, all its edges but the last one e belong 
to E3. Moreover, by definition of Po(f), this path is to the left and no longer 
than the path P(f) connecting the root to f in £3. Applying Lemma 9.3.13 
to e, P C Po(f) and P’ Cc P(f) and comparing the length of these two paths 
shows that P’ is longer than P, so that the two components of T \ e have not 
the expected number of buds and leaves. 7 


The opening of M with respect to (£1, £2) with EZ the leftmost breadth 
first search tree of M* at r* will be called simply the opening of M. In view 
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(a) Bicoloration of faces. (b) Corresponding map. 


Figure 9.21. Inverse of the edge-map construction. 


of Lemma 9.3.10, Proposition 9.3.11 completes the proof of Theorem 9.3.7: the 
opening is the inverse of the closure. Moreover it induces a linear time algorithm 
OPENING(M) that recovers the unique balanced blossoming tree T such that 
CLosuRE(T)= M: 


OPENING(M) 
1 Perform a leftmost bfs traversal of the dual map M™* starting from r*. 
2 Open the edges of the resulting tree to create buds and leaves. 
3 Return the resulting balanced blossom tree. 


9.3.5. A code for planar maps 


Theorem 9.3.7 deals with a specific family of planar maps, namely 4-valent ones. 
It turns out however that 4-valent maps play for planar maps the role that edge- 
graph play for graphs. More precisely, define the edge-map of a planar map MW 
as the 4-valent map M* having as vertex set the set of edges of M and having 
an edge c? for each corner c of the map M. 


PROPOSITION 9.3.14. The edge-map construction is a bijection between pla- 
nar maps with n edges and 4-valent maps with n vertices. In particular the 


ine fg 2-3" (2n 
number of planar maps with n edges is +535 (“,)- 


Proof. The inverse construction follows from the remark that the faces of a 
4-valent map F' can be colored in two colors, black and white, so that adjacent 
faces have different colors. The planar map / is obtained by putting a vertex 
into each black face of F and joining these vertices by an edge across each vertex 
of F. a 


The edge-map construction thus allows us to deduce from Theorem 9.3.7 a 
code for the family of planar map. 
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ENCODEMApP(M) 
1 F— EpGEMAP(M) 


2 
3 
4 
5 
6 
7 


T — OPENING(F’) 
for node € T do 


w’ [node] — POSITIONOFBuD(node) 
T — REMOVEBUD(T) 


w <— BINARYCODE(T) 
return (w,w’) 


THEOREM 9.3.15. The algorithm ENCODEMAP() encodes a planar map with 
n edges by a pair of words respectively in {a, b}?” and {0,1,2}”. In view of the 
number of planar maps, this code is optimal. 


Problems 


Section 


9.1.1 


9.1.2 


9.1.3 


9.1.4 


9.1.5 


Section 


92.1 


*9.2.2 


9.1 


Show that the generating function of a rational language with respect 
to the length is rational. 
Compute the generating function with respect to the length of walks 
that never immediately undo a step they have just done. 
Define the area under a Dyck word as the number of integer points 
between the horizontal axis and the associated walk. Use Catalan’s 
factorization to show that the sum of the area under all Dyck words of 
length 2n is 4”. 

(Chottin and Cori 1982) 
Show that an algebraic language that can be generated by a non am- 
biguous context free grammar has an algebraic generating function with 
respect to the length. 
Give a bijective proof of the fact that the number of bicolored Motzkin 
words of length n is equal to the number of Dyck words of length 2n +2. 
Give a bijective proof of the right hand side formula in Proposition 9.1.9 
for the number of loops of length 2n that stay in the quadrant (x > 
0,y > 0). 

(Guy et al. 1992) 


9.2 


What is the number of staircase and unimodal polygons with semi- 
perimeter n? 

Show bijectively that the number of convex polyominoes with bounding 
box (p,q) is 


2p + 2q 2p+2q-—1 pt+tq-1\/pt+q-1 
—Ap+ : 
( 2p of 2p—1 re q Pp 
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What is the number of convex polyominoes with semi-perimeter n? 
(Bousquet-Mélou and Guttmann 1997,Gessel 2000) 
9.2.3 An animal on the square lattice has compact source if there exists k 
such that every vertex of the animal can be reached from one of the 
vertices (¢,k — i) with 0 <i < k by a path going north or east inside 
the animal. In particular directed animals are exactly the animals with 
compact source for k = 0. 
Prove that there are 3”~! animals of size n with compact source. 
(Gouyou-Beauchamps and Viennot 1988) 
*9.2.4 Give a bijection between bilateral Dyck paths of length n and (non 
necessarily strict) pyramids of n bricks such that the number of pairs of 
steps connecting levels i and i+1 is mapped onto the number of bricks 
in position (7,7 + 1). 
(Viennot 1986) 
**9.2.5 Give a uniform random sampling algorithm of expected linear complex- 
ity for the set of words of length n on an arbitrary fixed finite alphabet 
that have the Lukasiewicz property. 


Section 9.3 


9.3.1 Give a direct bijection between plane trees with n edges and binary 
trees with n nodes. 
*9.3.2 What is the number of rooted planar maps with d; vertices of degree 2% 
for all i > 0 and no odd degree vertex? 
(Schaeffer 1997) 
**9.3.3 Compute the generating function of rooted planar maps according to 
the distribution of degrees. 
(Bouttier et al. 2002) 
**9.3.4 Show that planted plane trees with two leaves per inner vertices are 
in one-to-one correspondence with rooted triangulations with a marked 
face. 
(Poulalhon and Schaeffer 2003) 


Notes 


Although this chapter can be read independently, it is intended as a companion 
to Chapter 11, Words and trees, in Lothaire 1997. Systematic approaches to 
enumeration, in particular using generating functions, are described in the books 
Goulden and Jackson 1983, Bergeron et al. 1998 and in the more recent Stanley 
1999, Flajolet and Sedgewick 2002. In particular the relevance of rational, 
algebraic and D-finite series to enumeration is emphasized in the last two ones. 

The enumeration of walks in the plane, in the half plane and in the quar- 
ter plane has become part of the combinatorial folklore, as well as Dyck walks 
and Catalan’s factorization. The cycle lemma is attributed in the combinatorial 


Version June 23, 2004 


Notes 479 


literature to Dvoretzky and Motzkin 1947, where it is used to derive Proposi- 
tion 9.1.4. As first shown by Raney 1960 (see also Chapter 11 of Lothaire 1997), 
the cycle lemma is a combinatorial version of the Lagrange inversion formula, 
which has numerous applications in enumerative combinatorics. More detailed 
historical accounts can be found in Pitman 1998 and Stanley 1999. 

The classification of the possible asymptotic behaviors of the Taylor coef- 
ficients of an algebraic series can be found in Flajolet 1987. The generating 
function of walks on the slitplane according to the length and the coordinates of 
the extremities was first shown algebraic and computed in Bousquet-Mélou and 
Schaeffer 2002. This is one in a series of results obtained recently by writing 
and solving linear equations with catalytic variables, see Banderier and Flajolet 
2002, Bousquet-Mélou 2002 (these references are also good entry points to the 
literature on counting walks on lattices). The first proof we present illustrates 
a very general approach developed in Bousquet-Mélou 2001. The second proof 
is taken from Barcucci et al. 2001. 

The foundation of combinatorial random generation was laid in Nijenhuis 
and Wilf 1978 with the recursive method. As shown in Flajolet et al. 1994, 
this approach leads systematically to polynomial algorithms for decomposable 
combinatorial structures. The (much more specialized) application of the cy- 
cle lemma to random generation is discussed in Dershowitz and Zaks 1990 and 
Alonso et al. 1997. The Florentine rejection algorithm is taken from Barcucci 
et al. 1995. A systematic utilisation of mixed probabilistic/combinatorial argu- 
ments for sampling was recently proposed in Duchon et al. 2002. 

General references on polyominoes are Klarner 1997, van Rensburg 2000. 
Exact enumerative results are surveyed in Bousquet-Mélou 1996. The algo- 
rithms to sample convex and directed convex polyominoes are adapted from 
Hochstattler et al. 1996 and Del Lungo et al. 2001. From the enumeration point 
of view, these results are encompassed by Bousquet-Mélou and Guttmann 1997, 
which deals with convex polygons in any dimension. Our treatment of directed 
animals and heaps of bricks is adapted from Bétréma and Penaud 1993. These 
results built on the combinatorial intepretation of the commutation monoid of 
Cartier and Foata 1969 in terms of heaps of pieces due to Viennot 1986. 

Starting from the seminal work of Tutte 1962, the literature on combinatorial 
maps has grown almost independently in combinatorics and in physics. Some 
surveys are Cori and Machi 1992 (combinatorial point of view), Ambjgrn et al. 
1997 (physical point of view) and Di Francesco 2001 (mixed points of view). 

A more detailed description of codes for plane trees appear in Chapter 11 
of Lothaire 1997. The idea to use algebraic languages to encode maps already 
appeared in Cori 1975, and plane trees are explicitly used in Cori and Vauquelin 
1981. Conjugacy classes of trees were introduced in Schaeffer 1997, as well as 
the bijection between balanced trees and planar maps. Applications to coding 
and sampling are discussed in Poulalhon and Schaeffer 2003. 
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10.0. Introduction 


This chapter shows some examples of applications of combinatorics on words 
to number theory with a brief incursion into physics. These examples have a 
common feature: the notion of morphism of the free monoid. Such morphisms 
have been widely studied in combinatorics on words; they generate infinite words 
which can be considered as highly ordered, and which occur in an ubiquitous 
way in mathematics, theoretical computer science, and theoretical physics. 

The first part of this chapter is devoted to the notion of automatic sequences 
and uniform morphisms, in connection with transcendence of formal power series 
with coefficients in a finite field. Namely it is possible to characterize algebraicity 
of these series in a simple way: a formal power series is algebraic if and only if 
the sequence of its coefficients is automatic, i.e., if it is the image by a letter- 
to-letter map of a fixed point of a uniform morphism. This criterion is known 
as Christol’s theorem. A central tool in the study of automatic sequences is 
the notion of kernel of an infinite word (sequence) over a finite alphabet: this 
is the set of subsequences obtained by certain decimations. A rephrasing of 
Christol’s theorem is that transcendence of a formal power series over a finite 
field is equivalent to infiniteness of the kernel of the sequence of its coefficients: 
this will be illustrated in this chapter. 

Examples of applications of the properties of automatic sequences to tran- 
scendence results for power series over the rationals, and for real numbers whose 
base b-expansion is automatic are also given. 

Then, in a second part, this chapter uses a famous infinite word, the Tri- 
bonacci word as a guideline to introduce various applications in Diophantine 
approximation and in simultaneous approximation. The Tribonacci word was 
introduced as a generalization of the celebrated Fibonacci word. It is defined 
as the fixed point of a non-uniform primitive morphism, called the Tribonacci 
morphism. We first associate in a natural way a numeration system with this 
morphism, that leads us to the definition of a compact subset of the plane with 
fractal boundary, called the Rauzy fractal. By closely studying its topological 
properties, we show that this compact set can be considered as a fundamental 
domain for a lattice of the plane, and that a particular geometric transforma- 
tion, namely an exchange of pieces, can be performed on it. This transformation 
can furthermore be factored as a translation on the two-dimensional torus. The 
goal of this chapter is then to show how to deduce arithmetic properties of this 
translation from combinatorial properties of the Tribonacci word. In particular, 
it is shown how to associate with some prefixes of this infinite word best approx- 
imations for a given norm of the corresponding vector of translation. Relations 
to tilings and quasicrystals via the cut and project method are also mentioned. 
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10.1. Morphic and automatic sequences: definitions and 
generalities 


In this section we define morphisms, uniform morphisms, morphic sequences 
and automatic sequences. 


10.1.1. Topology and distance on the set of finite and infinite words 


Let A bea finite alphabet. The set A is equipped with the discrete topology (i.e., 
every subset is open), and the set A” of infinite words (that we also call here 
infinite sequences) on A is equipped with the corresponding product topology. 
It is well-known and not hard to prove that the product topology can also be 
defined by the following distance: 


d((tn)n>0; (Un)n>0) = g-seetish gens 


The topology on A” can be extended to the set A* UA” of all finite and infinite 
words on A as follows: let { be a symbol not in A. The set AU {tf} is equipped 
with the discrete topology and the set (AU {#})* is equipped with the product 
topology. Finally the set A* is naturally embedded in (AU {t})” by identifying 
the word ugu,-::uq in A* and the infinite word uu, --+ua(£)” in (AU {f})% 
(where (#)” stands for the infinite word whose terms are all equal to {). 


REMARK 10.1.1. Note that the distance defined above can be informally de- 
scribed by saying that two words are close to each other if they coincide on their 
first letters. Also note that the set A’ is a compact set. 


10.1.2. Morphisms and uniform morphisms 


Let A and B be two alphabets. Let us recall that a morphism h: A* > B* isa 
map from A* to 6* such that for all u,v € A*, the relation h(uv) = h(u)h(v) 
holds. (In other words h is a homomorphism of monoids.) 


REMARK 10.1.2. 

e A morphism h : A* — 6* is defined by its values on the elements of A. 

e The iterates of a morphism h : A* — A* are denoted h’, 7 > 0, and 
defined by h°(a) = a for alla € A and hJ*! :=hoh’. 


The morphism h : A* — B* is called uniform if all the words h(a), a € A, 
have the same length. Let d be this common length, the morphism is called a 
morphism of length d or a d-uniform morphism or a d-morphism. 


10.1.3. Fixed points of morphisms, morphic sequences and auto- 
matic sequences 


PROPOSITION 10.1.3. Let A be an alphabet. Let h : A* — A* be a morphism 
such that there exists a € A and x € A* with the properties: 


Version June 23, 2004 


484 Words in Number Theory 


(i) h(a) = ae, 
(ii) Vj > 0, h? (x) #e. 

Then, the sequence of words a, h(a), h?(a), ...,h"(a), ... converges to an 

infinite word denoted h(a). This infinite word is a fixed point of the extension 

of h by continuity to infinite words. 


Proof. The hypotheses easily imply that hI*!(a) = arh(x)h?(x)--- hi (x), for 
j > 0. Hence the word h/(a) is a nontrivial prefix of the word hJ*!(a), which 
gives the convergence of the sequence of words h/(a) to an infinite word h®(a). 
Since hJ++(a) = h(h?(a)), letting 7 go to infinity establishes the claim. rT 


REMARK 10.1.4. In the sequel we will say that an infinite word wu on the al- 
phabet A is a fixed point of a morphism h: A* — A* if and only if it can be 
obtained as in Proposition 10.1.3 above. One thus has h(u) = u. 


The fixed points (in the sense of Remark 10.1.4) of a uniform morphism have 
a simple property that we give now. 


PROPOSITION 10.1.5. An infinite word (Un)n>o0 on the alphabet A is a fixed 
point of the d-morphism h : A* — A* (in the sense of Remark 10.1.4) if and 
only if there exist d maps h, : A— A, r € [0,d— 1], such that 


Yn > 0, Vr € [0,d—1], tansr = hr (un). 


Proof. Suppose that the infinite word (un)n>0o is a fixed point of the d-morphism 
h: A* — A*, ie., the limit when j goes to infinity of the sequence of words a, 
h(a), h?(a), ..., h*(a) ..., with a € A and h(a) = az, with « € A* and Ai (x) Fe 
for all 7 > 0. Since h is d-uniform, for each letter e € A, the word h(e) can 
be written as h(e) = GeoQe1°*:Me—1- We define the maps h, : A — A, 
r € [0,d—1], by: for each e € A, h,(e) := ae,,. Now, for each k > 0, the length 
of the word h*(a) is equal to d* hence 


A¥(a) = uot ++ Uge_1- 
We thus have 
ups «++ Ugesr_y = h**H(a) = h(A¥(a)) 

= h(uguy- ++ Ugr_1) 

= A(uo)h(ur) +++ h(uae—1)- 
This thus gives: 

Wn € (0, a* _ 1), Vre [0,d— 1], Udn+r = hy (un). 

Since this holds for all k > 0, we thus have 


Yn >0, Vr € [0,d—1], Uangr = hr (un). 
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Conversely suppose that there exist d maps h, : A — A, r € [0,d— 1], such 
that 
Yn >0, Vr € [0,d—1], Uangr = hr (Un). 


Taking n = r = 0, we get uo = ho(uo). Define the morphism h : A* — A*, by 
Ve € A, h(e) := ho(e)hi(e) +--+ ha-i(e). 
Furthermore let a := ug. The morphism h is clearly uniform. We have 
h(a) = h(uo) = ho(uo)hi(uo) +++ ha-1 (uo) = Uohi (uo) +++ ha-1(uo) = ax, 


where x := hy(uo) +++ ha—1(uo). For all j > 0 we clearly have |h3(x)| = d?(d—1), 
hence hj (x) # ¢. Thus Conditions (i) and (ii) of Proposition 10.1.3 are satisfied. 
It is then easy to check that h“(a) is precisely the word (un)n>o- r 


A word (Un)n>0 on the alphabet A is called a morphic sequence (or substi- 
tutive sequence) if there exists an alphabet C, a word (vn)n>o0 on C, a morphism 
h:C* + C*, and a map vy: C > A such that 

(i) word (vn )n>0 is a fixed point of the morphism h (see Remark 10.1.4), 

(ii) for all n > 0, one has un = (vn). 


A word (Un)n>0 on the alphabet A is called an automatic sequence if there 
exists an alphabet C, a word (Un)n>0 on C, a uniform morphism h : C* — C*, 
and a map y:C — A such that 

(i) the word (vn)n>o0 is a fixed point of the uniform morphism h (see Re- 
mark 10.1.4), 

(ii) for all n > 0, one has un = (vn). 


If the morphism h has length d, the word (un)n>o0 is called d-automatic. 


REMARK 10.1.6. An automatic sequence is in particular morphic. The de- 
nomination “automatic” comes from the fact that such an infinite word can be 
generated by a finite automaton. 

10.1.4. Examples of morphic and automatic sequences 

10.1.4.1. The Fibonacci word 


The (binary) Fibonacci word is defined as the fixed point (in the sense of Re- 
mark 10.1.4) of the morphism 0 > 01, 1 — 0, on the alphabet {0,1}. The first 
few terms of this word are 


OLODLOLOOLOOIOIO-:-:- 


The name of this word comes from the fact that iterating the morphism start- 
ing from 0 gives words whose lengths are equal to the Fibonacci numbers 
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fo BBG ane 
0 
01 
010 
01001 
01001010 


It can be shown that this word is a Sturmian word, i.e., that the number of 
blocks of consecutive letters of length n occurring in the word is equal to n+ 1 
for each n > 1 (see Problem 10.7.1). 


10.1.4.2.. The Tribonacci word 


The Tribonacci word is defined as the fixed point (in the sense of Remark 10.1.4) 
of the morphism 1 > 12, 2 > 13, 3 > 1 on the alphabet {1, 2,3}. The first few 
terms of this word are 


1213121121312121--- 


Both words share many properties and the Tribonacci word can be considered 
as a generalization of the Fibonacci word, hence the terminology. We study in 
more details the Tribonacci word in Section 10.7—10.9. 


10.1.4.3. The Thue—Morse word 


Let us recall (see Example 1.8.4) that the (Prouhet )-Thue—Morse word is defined 
as the fixed point (in the sense of Remark 10.1.4) beginning with 0 of the 
morphism 0 — 01, 1 — 10. The first few terms of this word are 


OLLOLOOLLOOLOILIOL-:-:- 


The n-th term (starting from index 0) of this word is 0 if the sum of the binary 
digits of n is even, and 1 if this sum is odd. This property can easily been 
deduced from the results of Section 10.2.3. 


10.1.4.4. The Rudin-Shapiro word 
We consider on the alphabet {a, b, c,d} the morphism 


a— ab 
b — ac 
c — db 
d — dc 


Iterating this morphism starting from a gives the following fixed point 


abacabdbabacdc::: 
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The image of this infinite word by the map a > 1, b— 1, c— —1, d— -1, is 
called the Rudin-Shapiro word. This word begins as follows 


+1 41 +1 ae, ae 1 +1 41 41 +1 1 1 1 


Denoting by a(n) the number of (possibly overlapping) blocks 11 in the binary 
expansion of n, it can be shown that the n-th term of the Rudin-Shapiro word 
is equal to (—1)*™. Here again, this property can easily been deduced from 
the results of Section 10.2.3. 


10.1.4.5. The regular paperfolding word 


We consider on the alphabet {a,b, c,d} the morphism 


a— ab 
b — cb 
c— ad 


d— cd 
Iterating this morphism starting from a gives the following fixed point 


abcbadcbabcdadcb... 


The image of this infinite word by the map a — 0,b —0,c—-1,d 1, is 
called the (regular) paperfolding word. This word begins as follows 


0010011000110110.--. 
Denoting this word by (Zn)n>0, it is easy to show that 
Zan = 0, ZAn+2 = 1, Z2Qn+1 = Zn 


(which gives an alternative definition of the paperfolding word). 

The proof of the following property of the paperfolding word is left to 
the reader. For any word w on the alphabet {0,1}, define the word w® as 
the word obtained by reading w backwards (in other words (wow1---we)® := 
weWe_-1 +++ Wo). Also define the word @ as the word obtaining from w by replac- 
ing 0’s by 1’s and 1’s by 0’s (in other words W := (1 — wo)(1 — wi)--: (1 — we). 
Define the map P on {0,1}* by P(w) := wOw®. (The map P is called perturbed 
symmetry.) Then the paperfolding word is equal to limj—... P/(0). 


10.2. d-Kernels and properties of automatic sequences 


10.2.1. d-Kernels 


Let (Un)n>o0 be an infinite word defined on the alphabet A. Let d > 2 be an 
integer. The d-kernel of the word (un)n>0, denoted K(d,(un)n), is the set of 
subsequences of the word (un)n>0 defined by 


K(d, (un)n) = {(Uakntr)n>0; k 2 0, re (0, dé _ Al}. 
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REMARK 10.2.1. It is easy to prove that the d-kernel of an infite sequence 
(Un)n>o0 is stable under the maps D;, 7 € [0,d— 1], defined on the set of 
sequences on A by 


V(2n)n>0 € AY’, Dj((Zn)n>0) = (Zan+j)n>0- 


Furthermore K(d, (un)n) is the smallest set that contains the sequence (un)n>0 
and is stable by the maps Dj, j € [0,d— 1]. 


10.2.2. Combinatorial characterization of automatic sequences 


The notion of d-kernel permits to give a simple combinatorial characterization 
of automatic sequences. 


PROPOSITION 10.2.2. Let (Un)n>o0 be an infinite sequence defined on the al- 
phabet A. Let d > 2 be an integer. Then, the following properties are equiva- 
lent: 


(i) the sequence (Un)n>o is d-automatic, 
(ii) the d-kernel K(d, (un)n) is a finite set, 


(iii) there exists a finite set of sequences F that contains the sequence (un)n>0 
and such that, if the sequence (Un)n>0 belongs to F then, for every j € 
[(0,d — 1], the sequence D;((Un)n>0) := (Van+j)n>0 belongs to F. 


Proof. (i) = (ii). We suppose that the sequence (Un)n>0 is d-automatic. Then 
there exists an alphabet C, a sequence (Up)n>0 on C, a uniform morphism h : 
c* + C*, anda map y:C — A such that the sequence (vp )n>o is a fixed point 
of the uniform morphism h and for all n > 0, one has uy, = y(n). 

In order to prove that the set K(d,(un),) is finite, it thus suffices to prove that 
K(d, (vn)n) is finite. We know from Proposition 10.1.5 that there exist d maps 
hy: A> A,r € [0,d—1], such that 


Yn >0, Vr € [0,d-—1], vantr = Rr (Un). 


An easy induction on k shows the following: let t € [0,d* — 1]; write its base d 
expansion (possibly with leading zeros) as tp-1...to; then 


Vn = 0, Vdkn+t = he, (he, see (ht, (Un))) a ne 


In other words there exists a map f; from A into itself such that Vn > 0, we 
have vgenst = fi(Un). The set A is finite, hence the set of maps from A to 
itself is also finite. This implies that there are only finitely many sequences 
(Udentt)n>0; with k >0,t¢€ (0, a* — 1). 


(ii) => (iii). This is an easy consequence of Remark 10.2.1. 
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(ii) > (i). Let F = {(uP) nso, (uP) ns0r-++, (we) nso} be a finite set of 
sequences, with (wh) noo = (Un)n>0 such that F is stable by the maps Dj, for 
j € [0,d—1]. Define the vector V(n) by 


V(n) := 


Let C c A! be the (finite) set of values of V(n). The fact that the set F is 
stable under the maps D; for j € [0,d— 1] implies that for each j € [0,d— 1] 
there exists a matrix 0; of 0’s and 1’s, having exactly one 1 on each row, such 
that 

Yn >0, V(dn +7) = 0;V(n). 


Using Proposition 10.1.5 we see that the sequence (V(7))n>0 is a fixed point of 
the d-morphism h of A* defined by 


Va €C*, h(a) := (O1a) (Oza) ... (Oa). 


Now the sequence (Un)n>0 = (ue nso is the (point wise) image of the sequence 
(V(n))n>o0 by the restriction to C of the first projection At > A. : 


REMARK 10.2.3. We have spoken in the proof of Proposition 10.2.2 (iii) above 
of vectors and matrices, although there is no vector space (nor module): the 
reader will be easily convinced that this is only a practical terminology (recall 
the special form of the matrices 0;). 


10.2.3. Examples of kernels of automatic sequences 


The Thue—Morse word, the Rudin-Shapiro word, and the paperfolding word are 
2-automatic (see their definitions in Section 10.1.4). Namely their 2-kernels are 
finite: 


~ the definition of the Thue—Morse word (uUn)n>o0 shows that won = Un and 
U2n+1 = 1+ uy for every n > 0; hence the 2-kernel of the Thue—Morse word is 


K(2, (un)n) = {(Un)n>0; (13 Un)n>0 $3 


one deduces that the n-th term (starting from index 0) of the Thue-Morse word 
is 0 if the sum of the binary digits of n is even, and 1 if this sum is odd; 

~ the property of the Rudin-Shapiro word (vp )n>o0 that Un = (—1)*(™, where 
a(n) counts the number of possibly overlapping blocks 11 in the binary expansion 
of the integer n, shows that van = Un, Vant1 = Un; V4n+3 = —Ven+1, for every 
n > 0; hence the 2-kernel of the Rudin-Shapiro word is 


K(2, (Un)n) = {(Un)n>0, (Von+41)n>0, (—Un)n>0; (—Von+1)n>0}} 
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denoting by a(n) the number of (possibly overlapping) blocks 11 in the binary 
expansion of n, one deduces that the n-th term of the Rudin-Shapiro word is 
equal to (—1)*™); 


— since the regular paperfolding word (z,)n>0 satisfies za, = 0, Zany4o = 1, 
Zan+1 = Zn for every n > 0, its 2-kernel is 


K(2, (2n)n) = {0, 1, (2n)n>0; (22n)n>0}- 


10.2.4. Properties of automatic sequences 


We give below some properties, in particular closure properties, of automatic 
sequences. 


PROPOSITION 10.2.4. Let d> 2 be an integer. Let (Un)n>o0 be a d-automatic 
sequence on the alphabet A. Then the sequences (ugn )n>o0 and (Ugn—1)n>0 are 
periodic from some point on. 


Proof. We prove only the second assertion, the first one is proved analogously. 
Since the d-kernel of the sequence (un,)n>0 is finite (Proposition 10.2.2), the set of 
subsequences {(Ugkn+gk—1)n>0, k = O}, is finite. In particular there exist k > 0 
and j > 1 such that the sequences (tgrnygr—1)n>0 and (Ugk+inpgk+s—1)n>0 
are equal. In other words, the sequences (Ugkn—1)n>1 and (Ugk+in—1)n>1 are 
equal. Replacing n by q/n, q7/n, ..., shows that the sequences (Ugin_1)n>1 and 
(Ugk+ein—1)n>1 are equal for all a > 0. Taking n = 1 concludes the proof. a 


PROPOSITION 10.2.5. 

Let d > 2 be an integer. Let (Un)n>0 and (Un)n>0 be two d-automatic 
sequences defined respectively on the alphabets A and B. Then the sequence 
(Un; Un)n>o0 defined on the alphabet A x 6 is d-automatic. 


Let d > 2 be an integer. Let (un)n>o0 be a d-automatic sequence defined on 
the alphabet A. Let B be an alphabet and f be a map f : A — B. Then the 
sequence (f(Un))n>o0 is d-automatic. 


Proof. The proofs of both assertions are straightforward using the characteriza- 
tion of automatic sequences given in Proposition 10.2.2. rT] 


PROPOSITION 10.2.6. Let (Un)n>0 be a sequence on the alphabet A that is 
ultimately periodic (i.e., periodic from some point on). Then, the sequence 
(Un)n>o is d-automatic for every d > 1. 


Proof. Since the sequence (un)n>0 is ultimately periodic, there exist two integers 
no > 0 and T > 1, such that Vn > no, Untr = Un. Now, for d > 2, take a 
sequence in the d-kernel of (Un)n>0, Say (Uden+e)n>0, With k > 0 and ¢ € 
[0, d® — 1]. We have for all n > no 


Udk(n+T)+6 = Udkn+l+d'T = Udkn+e- 
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In other words all sequences (Un )n>o0 in K(d, (Un)n) satisfy Vn > no, Un+T = Un- 
Hence the d-kernel of (un)n>o is finite with at most (Card.A)"°t+? elements. 
| 


PROPOSITION 10.2.7. Let (Un)n>o0 be a d-automatic sequence defined on the 
alphabet A. Then 

(i) for all a,b € N, the sequence (tan+b)n>0 is d-automatic; 

(ii) the sequence (Vn )n>0 defined by vp = a € A and vp, = Un—1 for alln > 1, 
is d-automatic. 


Proof. (i) We may assume, from Proposition 10.2.6, that a > 1. The d-kernel 
K(d, (Un)n) of the sequence (Un)n>0 is finite. Let 


K(d, (un)n) = {CUP )nz0, (UP) n205--+) (Uy) n20}; 
with (uM) .>0 = (Un)n>0. Let us then define the set £ of sequences by 
Lis HOxtee ter aE (1, ¢], b € [O,a+ b= i}. 
The set £ is clearly finite with at most t(a +b) elements. It thus suffices 
to prove that the d-kernel of the sequence (Uan+b)n>0 is a subset of L. Let 
(Ua(akn+0)+b)n>0 be a sequence in the d-kernel of (wan+o)n>0, Where k > 0 and 
£€ [0,d* — 1]. We write af + b = ad* + y, with y € [0,d* — 1]. Thus, 
= — 
Ua(den+£)+b = Udk (an+a)+y = Uan+ea 

for some i that does not depend on n. Furthermore we have 

ad® <ad* +y =al+b< a(d*—1)+b< ad* +b < (a+b)d*. 
Hence x < a+b, and the sequence (Ua(din+0)+b)n>0 belongs to L. 

(ii) Let us write, as above, the (finite) d-kernel of the sequence (Un)n>0 as 


K(d, (tn)n) = {(uQ?)nz0, (Uy )nz0,--+5 (Un? )nzo}, 


with (wl nso = (Un)n>0- Let us then define t sequences (oO) ns0, i € [1,4], by: 
ol = a and y — uw, for n > 1. Note that Cane = (Un)n>o0- Consider 
the (finite) set M defined by 


M = {(u) nso, (u?))n>0, tees (ue) n>0, (U)n>0, (v2) n>0, tee (uv) n>o}- 


It suffices to prove that K(d,(Un)n) C M. Let (vgenie)n>0 with k > 0 and 
é € [0,d* — 1] be an element of K(d, (un)n). 


e If > 1, then, vgenze = Uakn+(e-1)- Since (€—1) € [0, d* — 1], the sequence 


(Uatn+(e-1))n>0 is equal to (uM) .>0 for some i € [1,t] hence it belongs to M. 
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e If €=0, then (vgense)n>0 = (Vgen)n>o- Ifn > 1, let m=n—-—1>0. We 
have: 


Vdkn = Udkn—1 = Udkm+dk-1 = ud = rae 
for some 7 that does not depend on n. Hence 
a ifn=0 (3) 
Usk = ; =U". 
oa u®., ifn S1 m " 


n—1 


REMARK 10.2.8. This proposition implies in particular, using (i), that shifting 
a d-automatic sequence gives a d-automatic sequence. Using this remark and 
(ii) shows that finite modifications of a d-automatic sequence give a d-automatic 
sequence. 


PROPOSITION 10.2.9. Let d > 2 be an integer. Let (un)n>0 be a sequence on 
an alphabet A, such that there exists a € N \ {0} for which all subsequences 
(Uan+b)n>0 are d-automatic, for b € [0,a— 1]. Then the sequence (Un)n>0 is 
d-automatic. 


Proof. In order to prove that the d-kernel of the sequence (un)n>0 is finite, it 
suffices to prove that the set of sequences of the form (ug*(an+b)+e)n>0 for k > 0, 
é€ [0,d* — 1] and b € [0,a — 1] is finite: namely interspersing these sequences 
produces the sequences (ugkn4e)n>0 for k > 0, ¢ € [0,d* — 1]. 


Now, for k > 0, é € [0,d* — 1], and 6 € [0,a — 1], let d*b+ €=ar+s with 
s € [0,a— 1]. This implies 


ar <ar+s=d"b+< d*b+d*—-1<d*(b+1) < d*a 
hence r € [0,d* — 1]. Then, for n > 0, 
Ugk (an+b)+£ = Ua(dkn+r)+s° 


This shows that the sequence (ug*(an+b)+¢)n>0 belongs to the d-kernel of the 
sequence (tan+s)n>0, hence to the (finite) set 


Ll ACCe Xtgacia la) " 


s€[0,a—1] 


CoROLLARY 10.2.10. Let (Un)n>0 be a sequence defined on the alphabet A. 
Let d > 2 be an integer. Then the following properties are equivalent: 


(i) the sequence (tn)n>o is d-automatic; 


(ii) there exists an integer a > 1 such that the sequence (Un)n>0 is d°- 
automatic; 


(iii) for every integer a > 1 the sequence (un)n>o is d*-automatic. 
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Proof. The implication (iii) = (ii) is trivial. Furthermore, we clearly have, for 
any integer a > 1, the inclusion K(d%, (un)n) C K(d,(Un)n), which shows that 
(i) => (iii). 

It remains to prove that (ii) > (i). Suppose that the sequence (Un)n>0 is 
d°-automatic, for some a > 1. If a = 1 we are done. Hence we can suppose 
that a > 2. Define d’ := d®~1. Fix j € [0,d’ — 1], and define the sequence 
(Un)n>0 by: Un <= Uan+j- This sequence (vp)n>0 is d-automatic: namely for 
each 7 € [0,d—1] we have van4i = Ud an+d'it+j = Uden+a'i+j, hence the sequence 
(van+i)n>0 belongs to the finite set K(d*,(Un)n) (note that d’i+ 7 < d* —1). 
Applying now Proposition 10.2.9 with a = d’ — 1 ends the proof. rT] 


COROLLARY 10.2.11. Let d > 2 be an integer. Let (un)n>o0 be a d-automatic 
sequence defined on the alphabet A. Let B be an alphabet and let h be a 
uniform morphism h : A* — B*. Then the sequence (h(Un))n>o is d-automatic. 


Proof. We recall that the morphism h is extended by continuity to infinite 

sequences. Let suppose that the length of the morphism h is d’. Hence, for each 

letter e € A, the word h(e) can be written as h(e) = Ae,oe,1°++Me,a/—-1. We 

define the maps h; : A — A, i € [0,d’ — 1], by: for each e € A, hi(e) := ae 3. 
We thus can write the sequence (A(Un))n>0 as 


ho(uo)h1 (uo) cee ha (uo )ho (ur) hi (ur) eee ha (u1) eee 
In other words we have, for all n > 0 and for all i € [0,d’ — 1], 
Ud n+i = hi (Un). 


But the sequences (Ai(Un))n>0, for i € [0, d’ — 1] are d-automatic, from Proposi- 
tion 10.2.5, hence the sequence (Un)n>0 is d-automatic from Proposition 10.2.9. 
| 


PROPOSITION 10.2.12. Let d > 2 be an integer. Let (Un)n>o0 and (Un)n>0 be 
two d-automatic sequences defined on the alphabet A. 


(i) If A is a module over a commutative ring R, then the sequences ((u + 
U)n)n>0 = (Un + Un)n>o and ((@U)n)n>0 *= (Un)n>o0, Where x € R, are d- 
automatic. 


(ii) If A is a finite commutative ring, then the (ordinary) product of the 
sequences (Un)n>o and (Un)n>0, Le., the sequence ((wv)n)n>0 = (UnUn)n>0 is 
d-automatic. 


(iii) If A is a finite commutative ring, then the Cauchy product of the 
sequences (Un)n>o and (Un)n>0, ie., the sequence ()ig<jcn UjUn—j)n>0, Is d- 
automatic. 
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Proof. Assertions in (i) and (ii) are easy consequences of Propositions 10.2.5 and 
10.2.6. Let us prove assertion (iii). Let k > 0 and £ € [0,d* — 1]. We first note 
that, for n > 1, writing any i € [0,d*n+ ¢] asi = d*m-+ j, with j € [0,d* — 1], 
we have 


d'm< d*m+j=i< d'n+e< d'n+d*-—1<d*(n+1). 
This implies m <n-+1, hence m < n. Hence the inclusion 
(0,d*n +2) c idm +4, man—1, 0< 7 < d* 1} Ulan +4, 7 € 10,4). 
The reverse inclusion is clear, hence 
(0, d*n + 4) ={d'm4+j, m<n-1, 0<j < d® —1}U {d'n+j, 7 € (0, Q}. 
This equality clearly implies 


(0, d*n+ 4] = {d*m4+j, m<n-1, <j < d®-1}U{d*'m4+j, m<n, j € [0, g}. 


Now let us consider our two d-automatic sequences and let us take an element 
in the d-kernel of the sequence (SCoc;<, UiYn—j)n>0, Le., let k > 0 and £ € 
[0, d* — 1], then 

> UiVakn4e—i = S1(m) + Sa(n) 


0<i<dknte 
where 
Si(n) = S 5 Udkm+j Vdk¥n+l—dk¥m—j 
t<j<dk—1 \0<m<n-1 
and 


S2(n) = s S UWakm+jVaknte—dkm—j 


0<j<l \O<m<n 


Writing S\(n), for n > 1, as 
Si(n) = S- DS Udkm-+j Va (n—1—m)-+dk 46-5 
l<j<dk—-1 \O<m<n-1 


we see that, for n > 1, S1(n) is a finite sum of sequences of the type 


is uf?) ve) 1—-m 


0<m<n-1 


where the sequence (u\!))n>0 (resp. (v*))m0) belongs to the d-kernel of the 
sequence (tn)n>o (resp. to the d-kernel of the sequence (un)n>0)- 
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We also see that $'2(n) is a finite sum of sequences of the type 


where the sequence (u!))n>o (resp. (v{°”)n>0) belongs to the d-kernel of the 


sequence (tn)n>o (resp. to the d-kernel of the sequence (vn)n>0)- 
Hence the sequence (S(n) + S2(n))n>1 belongs to a finite set of sequences. 
Since $1(0) + S2(0) can take only finitely many values, we are done. rT 


10.2.5. A density property for “automatic” sets of integers 


This section is devoted to proving a density property of sets of integers defined 
by automatic sequences. Before stating it we need two definitions and a lemma. 
A subset M of the integers is said to have a density if the limit 


lim a Card{n < z, ne M} 
@“00 f 
exists. The value of this limit is called the density of the set M. 

A factor w of an infinite word z is said to have a density if the set of indices 
of occurrence of this factor in x admits a density, that is, if the limit of the 
number of occurrences of this factor in the first k terms of the word divided by 
k exists. The value of this limit, that we denote by z(w), is called the probability 
(or the frequency) of occurrence of the factor w in a. 


The following lemma is a direct consequence of the Perron—Frobenius theo- 
rem (for more details, see Section 1.7.2). 


LEMMA 10.2.13. Let M be a positive stochastic matrix, i.e., such that all its 
entries are nonnegative and all the entries in any column sum up to 1. Then the 
sequence of matrices M” converges and all the entries in the limit are rational 
numbers. 


Let h be a morphism on the alphabet C := {c1,c2,---c}. The incidence 
matrix (also called the transition matrix or substitution matrix) of h is the 
t x t-matrix M = (Mi; )i,9 defined by 


M;; = |h(c;)|c, := number of occurrences of c¢; in h(c;). 


REMARK 10.2.14. The incidence matrix is the transpose of the matrix intro- 
duced in Section 1.8.6. If the incidence matrix of h is M, it is easy to see that 
the incidence matrix for h” is M”. If h is a d-morphism, it is clear that the 
entries in any column of its incidence matrix M sum up to d. We introduce 
this matrix also in this chapter in order to deduce probabilities of occurrence of 
letters. 
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PROPOSITION 10.2.15. Let d> 2 be an integer. Let (un)n>o0 be a d-automatic 
sequence on the set A. Let a belongs to A. If the set {n > 0, un = a} has a 
density, this density (which is the probability m(a) of occurrence of the letter a) 
must be a rational number. 


Proof. Since the sequence (Un)n>o is d-automatic, there exists an alphabet C, 
a sequence (Up, )n>0 on C, a d-morphism h : C* — C*, andamapy:C—-A 
such that: the sequence (v,)n>0 is a fixed point of the d-morphism h, and for 
all n > 0, one has un = y(vn). 

We first note that, for each letter c € C the limit 


1 


ee Card{m < d” —1, um = c} 


lim 
exists. Namely, let M = (M;;)i,; be the incidence matrix of the d-morphism h, 
and let M” = (M{”),;. Then 


(n) 


1 1 ij ae 
a Card{m < d” —1, un =c} = Ga lh” (v0) le = ra for some #, j. 


Since the matrix M/d is clearly positive and stochastic, Proposition 10.2.15 


(nm) 


shows that lim, —.—7+ exists and is rational. Hence, if C' = y“1(a) is the 
subset of C consisting of the elements of C whose image by y is equal to a, the 
limit i 

lim — Card{m < d” —1, um = a} 

noo qn 


is the sum over C’ of rational numbers, hence a rational number itself. Since 
the density of the set {m, Um = a} exists, it must be equal to the previous 
quantity, hence rational. rT] 


10.3. Christol’s algebraic characterization of automatic se- 
quences 
10.3.1. Formal power series 


We recall that the ring A[[X]] of formal power series with coefficients in a field 
K is defined by 


K[[X]] := 4 So unX”, Un € K >, 
n>0 
where addition and multiplication of the series FP := ye AS0 UnX” and G := 


indo OnX™ are defined by 


F+G:= \0(un+bn)X", FG:= 5) [| So uid; | X". 


n=O n>0 \itj=n 
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The ring K[[X]] is a subring of the field A((X)) of formal Laurent series 


K((X)):=¢ So unX”, n0 €Z, tn € Kp, 


n>—Nno 
where addition and multiplication are defined analogously. 


Note that the field of rational functions K(X) is a subfield of K((X)). Hence 
we can define algebraicity over K(X) for an element belonging to K((X)). 

The formal power series F = F(X) = )7,,5_», UnX" is said to be algebraic 
(over the field K(X)), if there exist an integer d > 1 and polynomials Ao(X), 
A,(X),..., Aq(X), with coefficients in AK and not all zero, such that 


Ao + AF + AgF? +---4+ AgF? =0. 


REMARK 10.3.1. 
e Any element of K(X) is algebraic over K(X). 
e The sum and product of algebraic elements are algebraic. 


eletF=>° 
Sear nunX"—! is also algebraic. Namely take an equation as above 
with minimal degree d. 


m ok east Shs os 
n>—no UnX" be an algebraic power series. Its derivative FY := 


Ag + Ai F + AgF? +++. + AgF? = 0. 


Taking the derivative gives 


AL tA PARAL tit ACR 2 (Ap OA Pod GAs), 
0 1 2 d 


The coefficient of F’ cannot be zero (d is minimal and the A,’s are not 
all zero). Hence F” is the quotient of two elements that are algebraic over 
K(X), thus it is algebraic over K(X). 


10.3.2. A simple example 
Let F(X) := Sos UnX” where (Un)n>0 is the Thue—Morse sequence. We have 


F(X) = » UanX?” + x Want iat? = > UnX?? +X So (un +1)x™" 
n>0 n>0 n>0 n>0 


= F(X?) + XF(X7)+X 


1 
1— X2- 
Hence we have, over the two-element field Fs, 
(1+ X)F(X)? + (1+ X)?F(X)+ X =0. 
In other words the series F'(X) is algebraic (actually quadratic) over the field 


F(X). 
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10.3.3. Christol’s theorem 


The example given in Section 10.3.2 above is actually a particular case of a 
general property of algebraic formal power series over a finite field F,(X), which 
is a characterization of these series. We begin with a definition and a lemma. 

Let g = p' be a positive power of a prime integer p. Let F, be the finite 
field of cardinality q (the characteristic of Fy is p). For 0 <r < q, we define the 
linear map \, on F,[[X]] by 


if F = F(X) =) 0 wX', then (PF) = > gig X?. 
i>0 i>0 


LEMMA 10.3.2. Let A= A(X) and B = B(X) be two formal power series in 
F,|[X]]. Then A= x X"X,(A)?, and \,(AB) = AA,(B). 
O0<r<q 


Proof. The proof is left to the reader who might want to remember that we 
have in F,[[X]] the equality 


os Uy XI )4 Sy ue: : 


n>0 n>=0 


We will also need a proposition proving that, in positive characteristic, any 
algebraic formal power series satisfies a “special” algebraic equation. 


PROPOSITION 10.3.3. Let p be a prime number, let a > 1 be an integer, and 
q := p'. Let F(X) be a formal power series with coefficients in Fy. Then F is 
algebraic over F,(X) if and only if there exist polynomials Bo(X),..., By(X) in 
F,[X] not all equal to zero, such that 


BoF + Bi F%+ BoF? +---+B,F? =0. 


Furthermore we can suppose that Bo # 0. 


Proof. If the formal power series F(X) satisfies 

BoF + BiF%+ BaF? +--+ BF? =0, 
where the polynomials B;(X) are not all equal to zero, then F is clearly algebraic 
over F,(X). Now, if F is algebraic, the series F, F%, Fe ..., cannot be all 


linearly independent. Hence there exists a nontrivial linear relation 


BoF + BiF%+BoFY +---+ BY =0. 


Let us prove that there exists such a relation with Bo 4 0. Suppose that 


BoF + BiF’ + BoFY +..-+ ByFY =0 
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with ¢ minimal, and let j be the smallest non-negative integer such that B; ¥ 0. 
We will prove that 7 = 0. Since 


By= Do X"(A,(B;))4 


O<r<q 


by Lemma 10.3.2, it follows that there exists r with r,(B;) # 0. Now, since 
ae B,F(X)*% =0, we have 


>> A(BiF*) =0 
JSi<t 
and, using (10.3.2), we see that, if 7 4 0, then 


S* A(B)FT =0, 


jist 


which gives a new relation with the coefficient of F g # 0, a contradiction, 
hence 7 = 0. We thus have the relation 


S_ Bi FT =0, 


O<i<t 


with Bo £0. a 


We now state Christol’s theorem. 


THEOREM 10.3.4. Let A be a non-empty alphabet, and let (un)n>o0 be a se- 
quence of elements of A. Let p be a prime number. Then the sequence (un)n>o0 
is p-automatic if and only if there exists an integer a > 1 and an injective map 
t: A— Fyo such that the formal power series )°,,...) U(Un)X" is algebraic over 
Fpa(X). 


Proof. Let us first suppose that the sequence (Un)n>0 is p-automatic. Choose 
a such that |A| < p%, and choose an injective map 4 : A — Fyo. Up to 
notations we may suppose that A C F,« and that z is the identity map. We 
thus want to prove that the formal power series }),,.9 unX” is algebraic over 
Fp«[X]. Since the sequence (Un)n>0 is p-automatic, it is also p°-automatic from 
Corollary 10.2.10. Hence K(p°, (un)n) is finite, say 


K(p%, (un) n) = {(u? azo; (UP nd0,°>- (UP) n>0} 
with (uso = (Un)n>0. Let us define 
F,(X) = Su) X" for j in [1,4]. 


n>0 
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Then, for 7 such that 1 < 7 < t, we have 


Ba\= S [yak el = SO ay wee. 


O<r<p*—1 \m>0 O<r<pe-1 m>0 


pm+r)m>0 is one of the sequences (u“(m))m>o, hence 
F(X) is a linear combination, with coefficients in the field F,«(X), of the 
power series F;(X”"). In other words, for all j € [1,¢], the formal power series 
F;(X) belongs to the F,«(X)-vector space generated by the t series F;(X”"), 
4 € [1,¢): 


But the sequence (u) 


F(X) € (F(X? ), F(X?" ),..., F(X?")). 
This implies that for all 7 € [1, ¢] 


F;(X?") e (E(x) Be), a ie cae 
and thus that for all 7 € [1,¢] 
Fy(X) € (Fi(X?"), Fa(XP™),... F(X"). 


Hence, for all j € [1, ¢], 


2a 


F;(X) and F;(X®") ¢ (Fi(X?""), Fo(X?"),..., Fe(X?")). 


This implies that, for all 7 € [1, ¢], 


3a 


F,(X?°) and Fj(X?") © (F(XP"), Fo(XP"), ..., FA(XP”)). 


Hence, for all 7 € [1, ¢], 


3a 


F,(X), Fj(X?") and Fy(X?") € (Fi(X?""), Fa(X?"),..., (XP). 


Iterating, we have, for all 7 € [1,¢] and for all k € [0,4], 


(t+1)a (t+1)a (t+l)a 


POO Veter Ce hi Oe 


But the dimension of a finitely generated vector space is at most the number of 
its generators. Hence the dimension of the F,,o(X)-vector space 


(t+1)a (t+1l)a (t+1)a 
Re) BOP), 


is at most t. Hence for any j € [1,¢], there must exist a nontrivial linear relation 
between the formal power series 


F;(X), Fj(X?"),..., F(X?) 


over F,«(X). Taking j = 1, and remembering that Fj(X?"") = re" (X) (the 
ground field is F,«) this gives that F(X) = Fi(X) = 0,5 uD X” is algebraic 
over Fyo(X). 7 
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Let us now suppose that there exist an integer a > 1 and an injective map 
tL: A — Fpye such that the formal power series }°,,.9 l(Un)X” is algebraic 
over Fpa(X). The sequence (tn)n>o is p-automatic if and only if the sequence 
(t(Un))n>0 is p-automatic. Up to renaming we can suppose that A C F,« and 
that the formal power series F' := }>,,59 UnX" is algebraic over F,«(X). Then, 
from Proposition 10.3.3, there exist polynomials Bo(X),..., B:(X) with Bo £0 
such that 


Define G = G(X) = Z&Y. Then 


G(X) = S> O,(X)G(X)" where O;,(X) := —Bi(X) BY ~?(X). 


1<i<t 


Now let N = max(deg Bo, max{deg C;}), and define H by 


H:= 4H CF, »([X]], H= S\ D,G* with D; ¢ Fp«[X] and deg D; < N 


0<i<t 


It is clear that # is a finite set and that F = BoG belongs to 7. We now prove 
that H is mapped into itself by ,. Let H € H. Then 


\r(H) = rr | DoG+ S* DG") =r, | S> (DoCi + DiGT 
1<i<t 1<i<t 
= S~ A(DoCi + DG". 


1<i<t 
Since deg Do, deg D;, deg C; < N, we have deg(DoC; + Di) < 2, and hence 
2N 
deg(A,(DoC; + Di)) < aa <N. 
Hence H is a finite set that contains F and that is stable under the maps 
A, for r € [0,p* — 1]. This clearly implies that the p*-kernel of the sequence 


(Un)n>o is finite. The sequence (un )n>o0 is thus p*-automatic, hence p-automatic 
(Corollary 10.2.10). = 
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10.4. An application to transcendence in positive charac- 
teristic 


The Christol theorem is a combinatorial criterion that can be used as a tool to 
prove the transcendence of formal power series over a finite field. We give here 
an automata-based proof of transcendence for the Carlitz formal power series 
IL. 

Let p be a prime number. Let a be an integer > 1 and let gq := p*. The 
Carlitz formal power series I, is defined by 


xX? xX 
w=). 
kl XT —X 
REMARK 10.4.1. Note that Il, belongs to Fy((X~')). 


THEOREM 10.4.2. The formal power series I, is transcendental over the field 
F(X). 


Proof. We first compute II/,/II,, where IIj, is the derivative of H, (with respect 
to X). It is easy to obtain: 


a a 
My, \o<x*—-X] xXI—-X 


If Il, were algebraic over F,(X), then II{, would also be algebraic in view of 
Remark 10.3.1. Hence H/,/II, would be algebraic. Since 1/(X4— X) is rational, 
this would imply that }>,., =< would be algebraic over F,(X). We then 


xa _Xx 
write 
n(q*—1) 
1 1 1 1 
Sao xlyaL(z) 
k>1 - = St = n>0 a 
k k 
7 1 ‘ 1 (n+1)(q an? 1 . 1 n(q*—1) 
Xx x xX xX 
k>1 k>1 
n>=0 n>1 
1 i= 
“xX (=) ny 
m>1 
where 
c(m) := ys l= »; l= s 1. 
kn>1 k>1 k>1 
n(q*—1)=m qk —1|m q®—1|m 


We then note that F,(X) = F,(X~'). Hence, replacing X by X~! in Christol’s 
theorem, we see that the algebraicity of Il, would imply the g-automaticity of 
the sequence (c(m))m>1- 
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Now, if the sequence (c(m))m>1 were g-automatic, then the subsequence 
(c(q” — 1))n>0 would be ultimately periodic by Proposition 10.2.4. But 


e(g-l)= So 1= $0 1=dn) 


by Problem 10.4.1, where d(n) is the number of positive integral divisors of n. 

Since gq = p* for some k > 1, where p is a prime, we would have that 
(d(n) mod p)n>1 is ultimately periodic. Hence there would exist integers t > 
1,no9 > 0 such that, for all n > np and k > 1, 


d(n+ kt) =d(n) (mod p). 
Take k = nk’. Then 
d(n(1 + k't)) =d(n) (mod p) 


for all k’ > 1. Now by Dirichlet’s theorem we can find k’ > 1 such that 
p' =1+k't isa prime. Take n = p’. We get 


d(p'*) = d(p’) (mod p) 


and hence 3=2 (mod p), hence the desired contradiction. rT 


10.5. An application to transcendental power series over 
the rationals 


We recall the following definition. 
A word w on the alphabet A is called primitive if it cannot be written as 
w = v for some word v in A and some integer a > 2. 


PROPOSITION 10.5.1. Let y(n) be the number of primitive words of length 
n over the alphabet A with Card A = k > 2. Then the formal power series 
R(X) = Vai Ve(n)X" is transcendental over Q(X). 


Proof. We recall that dx(n) = lain u(d)k”"/4, where pz is the Mébius function 
(see Problem 10.5.1). If the series R(X) = )),51 Ye(n)X” were algebraic over 


Q(X), then the series R(X) := sh OD ag would also be algebraic over 


Q(X). Thus (note that the number wale) is an integer for every n > 1) for any 
prime number p the series 


R,(X) = Yow mod p).X” 


k 
n>1 
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would be algebraic over the field F,(X) from Problem 10.5.2. Take any prime 
number p dividing & (recall that k > 2). We see that ~,(n)/k = u(n) mod p. 


Hence the series 
Yul) mod p)X" 
n>1 


would be algebraic over F,(X). It follows, using Theorem 10.3.4. that the 
sequence (y(n) mod p)n>o would be p-automatic. From Proposition 10.2.15 
this implies that, if the set 


{n> 1, u(r) = 0 mod p} = {n> 1, p(n) = 0} 


has a density, this density would be rational. But this set has a density equal 
to 1 — 6/n? (see Problem 10.5.1), which gives the desired contradiction. 2 


REMARK 10.5.2. Proposition 10.5.1 and the Chomsky-Schiitzenberger theo- 
rem imply the following result: if the language of primitive words over an al- 
phabet of size > 2 is context-free, it must be inherently ambiguous. 


10.6. An application to transcendence of real numbers 


We will prove in this section a theorem of transcendence (over the rationals) of 
real numbers whose base b-expansion is the fixed point of a morphism satisfying 
some extra hypotheses. This theorem is a consequence of a combinatorial version 
of a theorem of Ridout. We first give Ridout’s theorem without proof. 


THEOREM 10.6.1. Let € 4 0 be a real algebraic number. Let p, ci, C2, C3 
be positive constants, and let \ and py satisfy 0 < A,u < 1. Let r’,r” > 0 
be integers, and suppose W1,Ww2,...,W,r/4r" are finitely many distinct primes. 
Assume there exist infinitely many fractions p,/qn such that 


Pp = 
Be — el <x la an 
dn 


Furthermore, suppose that p, and qn are not zero and can be written in the 
form 


Tr r+r 
—. 9! ey / ej 
Pimp, [[ oF, dn = II We, 
j=l g=r'+1 


where the e; are non-negative integers that may depend on n, and the (p’,)’s 
and (q/,)’s are positive integers that may depend on n. Finally, suppose that 


d 
O<|p,|<c2lpnl”, O<|d,| <3 lanl”. 


for alln > 0. Then 
psrAt+up. 
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COROLLARY 10.6.2. Let € be an irrational number. Suppose that, for every 
integer n > 0, the base-k expansion of € begins by 0.UnVnVnV,, where Un, 
belongs to {0,1,---,k—1}*, V, belongs to {0,1,---,4 —1}*, and the word V, 
is a prefix of V,. Furthermore suppose that limn—oo |V;| = oo, and that there 
exist real numbers 0 < a < oo and @ > O such that for all n > 0 we have 


|Un| < alV,| and |V/| > G|V,,|. Then € is a transcendental number. 


Proof. Let rn = |Un|, sn = |Vn|, and s’, = |V/|, so, for all n > 0, we have 
Tm! < QS, and si, > Bs,. Define t, to be the rational number whose base-k 
: : Pp Z 
expansion is ty = 0.UnVnVnVn---. Hence th = Fake» 1)’ for some integer 


Pn- Note that 


etl < Goer 
Now, , 
S 
Tr a = l+a 
and 


Pn +28n4 8), fale 
Tn Sn l+a 


Hence there exist two positive real numbers pu, p such that 


9 / 
a 2124296 a ont a 


Tn =F Sn Tn =F Sn 


1+ 


for infinitely many n. With this choice of 4 and p, let us take p/, = pn, AX = 1, 


co = 1, ¢, = k* —1. Let us choose the primes wp41,°++,Wprzr” to be the 
, " : 
prime divisors of k. Finally, defining e,141,-++,@pr4,” by k™ = 1 eae w5", 


we can apply Ridout’s theorem if € were algebraic irrational, and deduce that 
p <A+ 4, which gives a contradiction. Hence € is transcendental. (Note that 
the t,’s are not necessarily in their irreducible forms, but there is an infinite 
number of them, since the sequence (t,), converges to €, which is irrational 
from the hypothesis.) rT 


We deduce a theorem on transcendence of certain “automatic” real numbers. 
By abuse of notation with respect to Section 1.2.2, we define here an overlap 
as a word of the form wwa where a is the first letter of w (in other words an 
overlap is the beginning of a cube just longer than a square). 


THEOREM 10.6.3. If the expansion of the real number € € (0,1) in some in- 
teger base b > 2 is a non-ultimately periodic fixed point of a d-morphism h for 
some d > 2, and if furthermore this expansion contains an overlap, then the 
number € is transcendental. 


Proof. We write the base-k expansion of € as € = 0.UVVa.---, where U and 
V are finite words, and a is the first letter of V. Since the expansion of € 


Version June 23, 2004 


506 Words in Number Theory 


is a fixed point of the d-morphism h, then this expansion also begins with 
An(U)hA"(V)h"(V)h”"(a) for every n > 1. We can apply the previous corollary 
with U, = h"(U), Vn = h”(V), and Vi = h™(a): namely |h"(U)| = d"|U\, 
|h°(V)| = d”|V], and |h”(a)| = d”. rT 


10.7. The Tribonacci word 


The aim of this section is to use the Tribonacci word as a guideline to intro- 
duce various applications of combinatorics on words and symbolic dynamics to 
arithmetics. 


10.7.1. Definitions and notation 


Let us recall that the Tribonacci word is defined as the fixed point (in the sense 
of Remark 10.1.4) of the Tribonacci morphism o : {1,2,3}* — {1,2,3}* defined 
on the letters of the alphabet {1,2,3} as follows: ¢: 11+ 12, 2+ 13,3161. 
Let us observe that the Tribonacci morphism admits a unique one-sided fixed 
point wu in {1,2,3}”. 


111 
The incidence matrix of the Tribonacci morphism o is M, = | 100). This 
010 
matrix is easily seen to be primitive. Hence the Perron—Frobenius theorem 
applies (for more details, see Section 1.7.2). 

Indeed the characteristic polynomial of M, is X? —X? — X —1; this polyno- 
mial admits one positive root 3 > 1 (the dominant eigenvalue) and two complex 
conjugates a and @, with ja| < 1; in particular, one has 1/3 = aa@. Hence 3 is 
a Pisot number, that is, an algebraic integer with all Galois conjugates having 
modulus less than 1. 

In particular, the incidence matrix M, admits as eigenspaces in R? one 
expanding eigenline (generated by the eigenvector with positive coordinates vg = 
(1/8, 1/67, 1/8%) associated with the eigenvalue 3) and a contracting eigenplane 
P; we denote by vq and vg the eigenvectors in C? associated with a and @, 
normalized in such a way that the sum of their coordinates equals 1. 

One associates with the Tribonacci word u = (Un)n>0 a broken line starting 
from 0 in Z® and approximating the expanding line vg as follows. Let us first 
introduce the abelianization map f of the free monoid {1,2,3}* defined by 


7 : {1;2,.3}" = 2, f(w) = |w|ier = |wl2e2 oP |w|3e3, 


where |w|; denotes the number of occurrences of the letter 7 in the word w, and 
(€1, €2,€3) denotes the canonical basis of R?. Note that for every finite word w, 
we have 


f(o(w)) = Mo f(w). 


The Tribonacci broken line is defined as the broken line which joins with seg- 
ments of length 1 the points f(wow1---un—1), N € N (see Figure 10.1). In 
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other words we describe this broken line by starting from the origin, and then 
by reading successively the letters of the Tribonacci word u, going one step in 
direction e; if one reads the letter 7. 

We will see in Section 10.7.3 that the vectors f(ugu1...un), N €N, stay 
within bounded distance of the expanding line, which is exactly the direction 
given by the vector of probabilities of occurrence ((1), 7(2), 7(3)) of the letters 
1,2,3in wu. It is then natural to try to represent these points by projecting them 
along the expanding direction onto a transverse plane, that we chose here to be 
the plane x+y+z= 0. The closure of the set of projected vertices of the broken 
line is called the Rauzy fractal and is represented on Figure 10.2. We detail this 
construction in Section 10.8.1. We then study the arithmetic and topological 
properties of the Rauzy fractal in Section 10.8.2 and 10.8.4, respectively, which 
leads to the proof of the main theorem of this section: Theorem 10.8.16 states 
that the Tribonacci word codes the orbit of the point 0 under the action of 
the toral translation in T?: x x+ ce a): We discuss in Section 10.9 some 
applications of this theorem to simultaneous approximations: it is proved that 
the points of the broken line corresponding to o"(1), n € N, produce best 
approximations for the vector (3 B) for a given norm associated with the 
matrix M,. 


Figure 10.1. The Rauzy broken line. 


10.7.2. Numeration in Tribonacci base 


We now introduce two numeration systems which will be used to expand here 
either natural integers or finite factors of the Tribonacci word. 

The sequence of lengths T = (Tn)n>o0 of the words o”(1) is called the se- 
quence of Tribonaccit numbers. One has To 1, T) 2, Tz = 4 and for all 
n EN, Tha3 = Tn42 + Tri1 + Tn. Indeed, one has for n € N 


a" *3(1) =o" 74 (12) = oF? (1)6"71 (13) = o ®**(1)a"* (eo (1). 


Let us observe that this sequence is increasing, and thus tends to infinity. 
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Figure 10.2. The Rauzy fractal. 


A greedy or normal representation in the system T of a nonnegative integer NV 
is a finite sequence of digits (€;)o<i<% where for alli, e; € {0,1} and e;,26;: 416; = 
0, €~ £0 such that 


LEMMA 10.7.1. Every nonnegative integer admits a unique normal T-repre- 
sentation. 


Proof. Let us first prove the existence of the decomposition by induction. We 
consider the following induction property: for any integer 0 < N < T;, (with 
k > 1), there exists a decomposition N = es €;T;, where for all i, e; € {0,1} 
and €;42€;41€; = 0. This property holds for k = 1,2. 

Suppose that the induction hypothesis holds for the integer k > 2. Let 
Th < N < Tri = Th + Te-1 + Tr-2; we have N — T, < T, and by hypothesis, 
N-Th = =. e;T; , hence N = Tk+ 3p e;T;. Assume that ¢,—1 = 1. Since 
N < Try = Te +Te-1 + Th_2, then €,~2 = 0, and the property holds for k+1. 

The unicity of a normal T-expansion is a direct consequence of the follow- 
ing observation: one has y 3 e;T; < Tr+1, where for all i, e¢; € {0,1} and 
Ej. 9€j 4185 = 0. 

This can be easily be proved by induction. Indeed if ¢, = e,-1 = 1, then 
Ep—2 = 0 and = 3 Eyl; = Tp + Tri + yo e;I;. By induction hypothesis, 
ye) eT; < Th-2, hence we get that 0", iT; < Te + Te-1 + Te-2 = Te+1- 

| 


LEMMA 10.7.2. Every prefix w of the Tribonacci word u can be uniquely ex- 
panded as 

w=0"(pn)o”* (Pn—1) ++ Pos 
where the finite words p,; are either equal to the empty word « or to the letter 1, 
Dn # €, and if p; = pj, = 1, then p;_2 = ¢; furthermore, |w| admits as normal 
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T-representation |w| = yan €,1;, with e; = 1 if pj = 1, and e; = 0, otherwise. 
Conversely every finite word that can be decomposed under this form is a prefix 
of the Tribonacci word. 

Such a representation is called normal Tribonacci representation. 


Proof. The proof works exactly in the same way as the proof of Lemma 10.7.1. 
Let us prove by induction on n > 1 that every prefix w of length |w| < |o”"(1)| 
can be decomposed as 


w =o"! (pp_-1)0”"? (Pn—2) ++ = Po; 


where the finite words p; are either equal to the empty word ¢ or to the letter 
1, pn-1 # €, and if pj = pj_1 = 1, then pj_2 = 0. The induction property holds 
for n = 1,2. 

Let w be a prefix of length at least 4 of the Tribonacci word. Then there 
exists a positive integer n > 2 such that |o"(1)| < |w| < |o”*1(1)|. One 
hago”? 71) =e" (1)e**1)e" 71), Put oy = 1; put py, = 1p at fe] > 
lo” (1)| + Jo? “1 (1), and pn_1 =e, otherwise. 

Let v be such that w = 0" (ppn)o"”~!(pn_1)v (v may be equal to the empty 
word); v is a prefix either of o’~1(1) or of o®~7(1). If pp—1 = 1, then |v| < 
|o”~2(1)|. We conclude by applying the induction hypothesis on v. 

The unicity of such an expansion, as well as the corresponding normal T- 
representation for |w|, is a direct consequence of the fact that |o”(1)| = T,, and 
of the unicity of normal T-representations (Lemma 10.7.1). 

Let us prove by induction on n that every finite word of the form 


ead 


oO Pn—1)0”" *(pn—2) “++ Do, 


where the finite words p; are either equal to the empty word ¢ or to the letter 
1, pn # €, and if p; = pj_1 = 1, then pj_2 = 0, is a prefix of the word o”*1(1). 
This property holds for n = 0,1. Assume that the hypothesis holds for every 
integer k <n—1. Let w=o0"(pn)o”'(pn—1) +++ po, with the above mentioned 
conditions on the “digits” p; (and in particular p,, = 1). 

One has 


ele" eae") =e"(lie ie) 
=e"(lje"—" (le (1), 


Assume pn_-1 = 1, then one has p,_2 = ¢. By induction hypothesis the 
word o”~3(pn_3)-++po is a prefix of o"~?(1), which implies that w is a prefix 
of o”(1)o"~*(1)o"~?(1) and thus of o”*1(1). 

Assume now pn—1 = €. Then o”~?(pn—2)-+++po is a prefix of o”—1(1), and 
w is a prefix of o"*1(1), which ends the proof. = 


REMARK 10.7.3. Such a numeration system on finite factors of the Tribonacci 
word can similarly be introduced for fixed points of morphisms in the sense of 
Remark 10.1.4 (see Problem 10.7.3). 
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10.7.3. Density properties: statistics on letters 


Since the Tribonacci morphism is primitive, we know from Section 1.7.2 and 
1.8.6 that, by applying the Perron—Frobenius theorem, the letters admit densi- 
ties in the Tribonacci word and the vector of probabilities of occurrence of letters 
is equal to the normalized positive (right) eigenvector vg = (1/8,1/87, 1/6?) 
associated with the dominant eigenvalue ( (let us recall that the incidence ma- 
trix is the transpose of the matrix introduced in Section 1.8.6). We give below 
a direct proof of this result and prove even a stronger result of convergence 
towards the probabilities of letters. 


PROPOSITION 10.7.4. Each of the letters 1,2,3 admits a density in the Tri- 
bonacci word. The probabilities of letters are positive. More precisely, the vector 
of probabilities (w(1), 7(2), 7(3)) is equal to the normalized positive eigenvector 
vg = (1/8,1/8?,1/8°) associated with the dominant eigenvalue 3 of the inci- 
dence matrice of the Tribonacci morphism. Furthermore, there exists C' > 0 
such that 

VN, | |uour ser UN-1a = m(i)N| < C. 


Proof. Let upui--:un— 1 bea prefix of the Tribonacci word; according to Lemma 
10.7.2, let us decompose it as 


Ug UN-1 =O" (Dn)o”*(Pn—1)*** Po; 


where the finite words p; are either equal to the empty word ¢ or to the letter 
1, pn # €, and if py = pr_1 = 1, then pp_2 = 0. Then for 7 = 1,2,3 


|uio---Un—ili =< f(uo-+-un—1),€4 >, 


where <> denotes the Hermitian scalar product in C°. 
Let us write e; = agug + dala + Gata, Where ag, da, Ga € C. We have 


f(o*(1)) = Mkey = agB* ug + agave + aga" vz. 
Furthermore, 
f(uo-+-un—1) = S5 f(o*(px)); 
k=0 
which implies for 7 = 1,2,3 


|uo ++ un—1|i = @a(D po [PnlB”) < vB, e: > +aa(d op Pnla*) < va, ei > + 
+az(d>p~0 \Dn (a) < Va, Ci >. 


Let us recall that |a| < 1. We have proved that the vectors f(o*(1)) converge 
exponentially fast to the expanding line, whereas the vectors f (uo ---un—1) stay 
within bounded distance of this line (Figure 10.1). 

One has 


N= iH1,2,3 |wo nae, UNn-1\a 
= ag aie |pn|B* + Ga ie [Pnlo* ar az > po Pn lar, 
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since < ug,e€1 + €2 + e€3 >=< VUq,e1 + €2 + €3 >=< Ua,e1 + eg +e3 >= 1, 
according to our conventions of normalization. 
Hence there exists C' > 0 such that 


Vi = 1, 2,3, |< f(uo-+:Un—1), ei >—-N <1g,e > | <C, 


which implies in particular that < vg,e; >= m(i) = 1/6", i= 1, 2,3. = 


REMARK 10.7.5. Proposition 10.7.4 holds more generally for Pisot morphisms 
(Problem 10.7.4) and is strongly connected to the balance properties of their 
fixed points (Problem 10.7.5). Let us observe that the statement in Proposition 
10.7.4 is stronger than Assertion 5 of the Perron—Frobenius theorem. 


10.8. The Rauzy fractal 


10.8.1. A discrete approximation of the line 


The Tribonacci broken line stays within a bounded distance of the expanding line 
(Proposition 10.7.4 and Figure 10.1). Let us project its vertices f(uo-:-un-—1) 
along the expanding direction vg, in order to obtain in particular some infor- 
mation on the quality of approximation of the expanding line by the points 
f(o*(1)), k € N. We thus choose here to project onto the plane x + y+ z = 0; 
this allows us to express the coordinates of the projected points in the basis 
(e3 — €1,€2 — e1) of the plane x + y+ z = 0 in terms of the convergence to- 
wards the probabilities of occurrence of the letters, as explained below (Equation 
(10.8.1)). 

Let 7 denote the projection in R® onto the plane Po of equation x+y+z = 0 
along the expanding line generated by the vector vg. One has 


VP =(a,y,z) €R®, mo(P) = (z,y,z)— < (a, y,z),(1,1,1) > vg, 


that is, 


To(P) = Fle ty +2)—2)(e3 —e1)4 (Gale ty+z)—y)(e3 — €2). 


In particular, if P = f(ug---unx_—1), for some N EN, then 


To(P) — epee ie . -UNn-1|2)(€3 —€2). (10.8.1) 


p p? 
We define the set R as the closure of the projections of the vertices of the 
Tribonacci broken line: 


R := {mo(f(uo-.-un-1)); N € N}, 


where up...wn_— 1 stands for the empty word when N = 0. The set R is called 
the Rauzy fractal associated with the Tribonacci morphism o (see Figure 10.2). 

We now introduce a lattice in the plane Pg which will play a key role in the 
following. Let Lo := Z3M Po; Lo is equal to the lattice Z(e3 — e1) + Z(e3 — e2). 
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PROPOSITION 10.8.1. The set R is compact. The translates of the Rauzy 
fractal by the vectors of the lattice Ly cover the contracting plane Po, that is, 


UyeLo(R + 7) = Po. (10.8.2) 


The interior of R is not empty. 


Proof. We first deduce from (10.8.1) and Proposition 10.7.4 that the Rauzy 
fractal is bounded, and hence compact. 

We then need the following lemma to prove that one has a covering of the 
plane Po by the translates of the Rauzy fractal. 


LEMMA 10.8.2. The translates along the lattice Lo of the vertices of the broken 
line f(upur...un—1), N EN, cover the following upper half space: 


{f(uow...un_1) +7; NEN, y € Lo} = {(a,y,2) ©; w+ y +z > 0}. 


Let (2,y,z) € Z> with xr +y+z> 0; let N=2x2+y+2; one has N = 
|uotr os .UN-11 + |uotr rare .UN-1|2 + |uo ey aa .UN-1|3- Let 


y = (a — |uou1...Uun—1|1, y — |Wot1 .-- UN-1|2, 2 — Juour... UN—1|3)} 


then 7 € Z(e1 — e3) + Z(e2 — e3) = Lo. r 


Let us end the proof of Proposition 10.8.1. We need the following theorem 
known as Kronecker’s theorem that we recall here without a proof (a proof of 
this theorem can be found for instance in Cassels 1957). 


THEOREM 10.8.3 (Kronecker’s theorem). Let r > 1 and let a1,---,a, be r 
real numbers such that 1,a1,-+--+,@, are rationally independent. For every n > 0 
and for every (a1,:-:,2,) € R", there exist N €N, (pi,--++,pr) € Z" such that 


Vi=l,---,7, |Na, — pi — v4| < 7. 


Let us apply Kronecker’s theorem to 1, 3 B (which are rationally independent). 
Let us fix 7 > 0 and let P be given in Po with coordinates (a,y) say, in the 
basis (e3 — €1, €3 — €2). There exist p,q € Z, N € N such that ING —p-a|< 
and |N gz —q-—y| <n. Take r= N—(p+q). Then the coordinates in the basis 
(e3 — €1,€3 — €2) of 7o(p,g,7r) and P differ by at most 7. We thus have proved 
that mo({(p,¢,r) € Z°, p+q+r > 0}) is dense in Pp. Consequently, given any 
point P of Po, there exists a sequence of points (mo(f(uow1..-UN,—1)) + Ye)K 
with y, in the lattice Lp which converges to P in Po. Since R is bounded, 
there are infinitely many k for which the points yz of the lattice Lp take the 
same value, say 7; we thus get P € R +4, which implies (10.8.2). Since Lo is 
countable, we deduce from Baire’s theorem that the interior of R is not empty. 

2 
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REMARK 10.8.4. In fact, we have more than a covering by translates of the 
Rauzy fractal. We have in fact a periodic tiling of the plane up to sets of zero 
Lebesgue measure, that is, the union in (10.8.2) is disjoint up to sets of zero 
measure, as illustrated in Figure 10.8.3. We prove it in Section 10.8.5. 


10.8.2. Arithmetic expression 


In order to study more carefully the topological properties of the Rauzy fractal, 
which is the aim of Section 10.8.4, we introduce some more notation to express 
the coordinates of the vectors f(uo:::wn—1) in the basis (e3 — e1,e3 — e2) of 
the plane Po. 

Let 6: N — R?, N+ d6(N), where 6(N) denotes the vector of coordinates 
of 79(f(uow1--:un—1)) in the basis (e3 — e1, €3 — e2) of the plane r+ y+2z=0. 
One has according to (10.8.1), for N € N, 


6(N) = N- (1/8, 1/8?) — (Juowr ++ un—ala, [uous ++ + un—sl2)- (10.8.3) 


=e. =e 
vet B= [yay hy} 
{1,2,3}*, then the vector of coordinates of 7(f(o(w))) in the basis (e3 —e1, e3 — 
€2) is equal to the matrix B applied to the vector of coordinates of mo(f(w)) 
in the same basis. We thus get that if N has for normal T-representation, 


n 

N= Soaili, that is, ug---un—1 = 0" (pn)o"—*(Pn—1) ++ Po, with |pi| = &, 
i=0 

then 


. One easily checks that for every word w € 


k 
6(N) = 5 e,B*z, where we set z = 6(1) = (1/6 — 1,1/8”). 
i=0 
The eigenvalues of the matrix B are of modulus smaller than 1, hence the 
series )>*, ¢;B'z are convergent in R?. The following proposition is thus an 
immediate consequence of this: 


PROPOSITION 10.8.5. The Rauzy fractal is the set of points of the plane Po 
with coordinates in the basis (e3 — e1,e3 — e2) in 


R:= oe 6, Bz; (Ei )i>0 E {0, 1}* Vi egeq41€442 = O}. 
i=0 


REMARK 10.8.6. We will mostly study the set R to deduce topological prop- 
erties of the Rauzy fractal R; indeed both sets are by definition in one-to-one 
correspondence, this bijection being the restriction of a topological isomorphism. 
Let us observe that similarly, the Rauzy fractal and 


o> ea" E C; (Ei )i>0 E {0, 1}? Vi Ej€i416142 = oO} 
i=0 
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are also easily seen to be in one-to-one correspondence. Indeed the matrix B 
admits as characteristic polynomial (X — a)(X — @), and it is thus similar in 


C? to the matrix E _|. 
0a 
10.8.3. An exchange of pieces 


Let us introduce the following division of the Rauzy fractal into three sets ac- 
cording to which letter was lastly read before projecting. For i € {1, 2,3} let 


Ri = {mo0(f (uo : ..UN-1))} NEN, un = i}. 


We similarly define the subsets R; of R?, i = 1, 2,3, as, respectively, the sets of 
coordinates of elements of R,; (in the basis (e3 — e1, e€3 — e2)). 


LEMMA 10.8.7. One has 
Ry = pen. 6, Bz, Vi, Ee, € {0, 1}; E465416342 = 0; Eg = o} . 
Rg = ido 6, B'z; Vi, ELE {0, 1}; Ej€j41€142 = 03 Epe, = 10? , 
Rg vido ,B*z; Vt, €, € {0, Ths Ej€141€142 = 0; e961 = 11>, 


and 
Ri = BR, Rp =z+ B’R, Rg =2z+Be+ BR, 


that is, 
Ry = BR, + Ro + R3), Rg=2z+BR,, R3=24+ BRo. 


Proof. It is sufficient to check that if uo ---un— admits for normal Tribonacci 
representation 0” (pn) ---o°(po), then 


po = € implies un = 1, 
po = 1, pi = € implies uy = 2, 
po = 1, py = 1 implies uy = 3. 


e Assume that po = ¢. Then ug---un—-1 = 0" (pn) +++ o(p1), and ug--- uN = 
oa" (pn)++:o(pi)un. Hence uy needs to be equal to 1, since the images of 
letters under o begin with 1, and uw is fixed under o. 


Assume that po = 1 and p; = €. One has ug: +: un—1 = 0" (pn) +++ 07 (pa)1. 
The word o" (pn) -++0?(p2)o(1) has length N +1. If either pe or p3 equals 
€, then this expansion is a normal Tribonacci representation, and thus a 
prefix of the Tribonacci word (according to Lemma 10.7.2), which gives 
un = 2. Otherwise it can also be represented as o"(pn)-+-o7(1) = 
o” (pn) +++ o4(1). One shows by induction that the last term of the nor- 
mal Tribonacci representation of this expansion is of the form o°*+1(1), 
which admits as last letter 2. 
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e Assume that po = 1 and p; = 1, and thus po = €. Then ug---un_—1 = 
o”" (pn) +++ 03(p3)o(1)1. The word o"(pn)-++o7(1) has length N +1. If 
either p3 or p4 equals ¢, then this expansion is a normal Tribonacci repre- 
sentation, and thus a prefix of the Tribonacci word (according to Lemma 
10.7.2), which gives uy = 3. Otherwise it can also be represented as 
o”" (pn) +++07(1) = o" (pn) +++0°(1). One shows by induction that the last 
term of the normal Tribonacci representation of this expansion is of the 
form o°*+?(1), which admits as last letter 3. . 


The sets R;, 7 = 1,2,3 are represented in Figure 10.2. Figure 10.8.3 illus- 


trates Lemma 10.8.8 below, that is, one can reorganize the division of R into 
these three pieces up to translations. 


LEMMA 10.8.8. The following exchange of pieces E’ is well-defined 


E: Int Ry, U Int Re U Int R3 > R, «+> x+70(e;), when x € Int Rj. 


Figure 10.3. The exchange map FE. 


Figure 10.4. A piece of a periodic tiling by the Rauzy fractal. 


Proof. Let us first prove that the sets R;, for 7 = 1, 2,3, are two-by-two disjoint 
in measure, and hence that their interiors Int R; are two-by-two disjoint. 
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Since R is compact, then it is measurable for the Lebesgue measure and 
its Lebesgue measure ju(72) is finite and nonzero since its interior is not empty 
according to Proposition 10.8.1. 

One has p(R) < ey u(R,;). Since the determinant of the matrix B equals 
1/@, then according to Lemma 10.8.7 


w(R1) = 1/8 WR), w(Re) = (1/8)? W(R), w(R3) = (1/8)? wR). 


Hence one gets pu(R) = S (FR; ). This implies in particular that u(RiNR;) = 
0 for i # j. The same holds for their interiors Int R;, that is, (Int Ri Mm 

Int R;) =0 for i ¥ 7, which implies that they are two-by-two disjoint. 
One easily sees that for i = 1,2,3, Ri +70(e:) = {a(f(uo..-un)); un = 4}, 
which implies R; + 7o(e;) C R. We thus deduce that the map F is well-defined. 
rT 


REMARK 10.8.9. The sets R;, 7 = 1,2,3 are not disjoint. Indeed a vector with 
coordinates in the basis (e3—e1, e3—e2) having several expansions as )7>*  €;BYz, 
(with (e;)i>0 € {0,1}” and Vi, e:¢:41€:42 = 0) can belong simultaneously to 
several of these sets. This is the case in particular of the vector with coordinates 
2, B*z. Since B? = B?+ B+1, then 77°, B%z = 0°, B’z, and it admits 
the following three admissible expansions 


foe) co foe) 
om BY zgaet 2 Bl, = 74 Bet S- Betty, 
i=l i=0 i=0 


10.8.4. Some topological properties 


We need now to introduce a suitable norm on R? associated with the matrix 
B that will be crucial for the statement of the first topological properties of 
the Rauzy fractal, from which the arithmetic properties of Section 10.9 will be 
deduced. 


Let us recall that the matrix B is similar in C? to the matrix E =| . Let 
a+1/6 1/6 | 
M= : 
Be 1/8) —1/6 
: Sof a0 
One easily checks that MBM~* = = 
The Rauzy norm || || is defined for x € R? as the Euclidean norm of Mz. 


Hence, for every x € R? 


|| Bal] = |al||a|] = v1/6|a|]. 


We denote by ||| ||| the distance to the nearest point with integer coordinates. 
One checks that 


llzl] = 118(1)| = le|* = 1/6? and |/5(T;)| = lal" ||zI| = lal" **. 
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We will mainly work in this section with the set of coordinates R, rather 
than with FR itself. The following lemma states that if one takes an element in 
R of sufficiently small norm which is equal modulo Z? to the coordinates 6(V) of 
To(f(uo+:+Un-1)), then it has to be exactly equal to 6(N). The proof is based 
on the fact that the set R is contained in the square {(x,y) € R?; |z|,|y| < 1} 
located at the origin. In particular, 0 is the only element with integer coordinates 
contained in R. This lemma is fundamental and is a first step toward the fact 
that if two points of R differ by a vector with integer coordinates, then these 
two points do coincide. 


LEMMA 10.8.10. There exists C > 0 such that 
VN > 1, Ww eZ, |IN- (1/8,1/62) - vl] <C = v= N-(1/8,1/8?) - 6(N). 


REMARK 10.8.11. This lemma implies in particular that if the norm of d(NV) 
is smaller than C, then (|uo---uy—il1, Uo +: UN—1|2) is the nearest point with 
integer coordinates to N - (1/3,1/8). For instance, for n large enough, 


INITn = (1/8, 1/8" )II| = 16(Tn)II, 


since ||6(T;,)|| = |a|"*+ < C. In other words, the projections of the points 
f(o"(1)) approximate very well the points with coordinates T), - (1/8, 1/3). 


Proof. Let N > 1 with normal T-representation N = ae e;T;. One can write 
6(N) = ss 6,B'z as 6(N) = 0s, €3iB*’ yi, where y; belongs to the following 
set F: ~ 

F := {0,z, Bz, B?z,z+ Bz,z+ B’z, Bz+ Bz}. 


Hence 
6(N)|| < a|** max ; 


One checks that maxyer ||y|| = ||z|| = 1/97. We thus get 


\|5(N) <i), (10.8.4) 


—_— 
G7(1— |e?) 


One also checks that the set of points 2 € R? such that ||2|| < 0,53 is a domain 
delimited by an ellipse strictly included in the square {(x,y) € R?; |z|,|y| < 1}. 
Hence R is also included in this square, following (10.8.4). 

Let v € Z?. Take C = 0,03 for instance. Let N > 1 such that 


IV - (1/8, 1/6") — || <C. 
Hence according to (10.8.3) 
[I(1uo + ++ uv—a]o, [uo --» uv—a|2) — vl] < |]5(V)|I + IL - 1/8, 1/87) — v]] < 0,53, 


which implies that (|uo---ux—1]o,|Wo--:un—il1) — v belongs to the square 
{(x,y) € R*; |2|,|y] < 1} and thus v = (|uo---un-ilo,|uo-+-un—i1), since 
both vectors have integer coordinates. a 
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PROPOSITION 10.8.12. The point 0 belongs to the interior of the Rauzy fractal 
R. Furthermore, for all N € N, 6(N) belongs to the interior of Ry, . Conse- 
quently, the Rauzy fractal is the closure of its interior. 


Proof. Let us prove that 0 is an interior point of the set R. Let C be the constant 
of Lemma 10.8.13. The sequence (N - (1/3,1/87))n>o is dense in R? modulo 
Z? by Kronecker’s theorem (Theorem 10.8.3), since 1,1/3,1/(? are linearly 
independent over Q. In particular, it is dense in the set {x € R?; ||z|| < C}. 
This implies, according to Lemma 10.8.10, that the points 6(NV) are also dense 
in this same set. Hence {x € R?; ||x|| < C} is included in the closure R of 
{6(N); N € N}. This proves that 0 is an interior point. 

One easily deduces that for every N € N, d(N) belongs to the interior of 
Ruy Indeed let us consider a given N with normal T-representation a egy} 
by definition, (NV) € Ruy; for any (€:)i>n42 € {0,1}”, with the admissibility 
condition that no three consecutive 1’s occur in this sequence, then 6(N) + 
isn e1Tt € Ruy, which implies that 6(N) + B**?R is still included in Ruy, 
and thus 6(NV) belongs to the interior of R,,,, since 0 belongs to the interior of 
R. This easily implies that 7? is the closure of its interior. a 


One can even get more information on the first coefficients of N in its nor- 
mal T-representation if the distance between N - (1/6,1/87) and Z? is small 
enough; this provides some knowledge on the repartition of the sequence (N - 


(1/8, 1/8?)) >0. 


LEMMA 10.8.13. Let N > 1 with normal T-representation N = iso e,l;. 
Then 7 


Wu €Z?, Wm EN, (IN - (1/8,1/6") — | < O67"? > Vi <m, ei = 0). 


Proof. Let N > 1 with normal T-representation N = )>,..) €iT; and let v € Z? 
such that there exists m > 1 with ||N - (1/@,1/6?) — v|| < CB-™/?. Since 
||N - (1/8,1/67) — v|| < C, and according to Lemma 10.8.10, then 6(N) = 
N -(1/8,1/67)—v. Furthermore one has B~™6(N) € R; indeed ||B~™6(N)|| = 
p™/?\|5(N)|| < C, and we have seen in the proof of Proposition 10.8.12 that 
{x; ||x|| < C} is included in R. 

It remains to prove that if N satisfies 6(N) € B™R, then its normal T- 
representation verifies N = }°,.,,€:. For that purpose, we introduce the fol- 
lowing notation in order to refine the partition of R into the three pieces R;, 
i = 1,2,3. Let us consider the three following three maps y; : R? — R?, 
i = 1,2,3, as follows (recall that z = d(1)): 


yy: ve Bu, yo: vi z+ BQ, v3: 0H 2+Bz+ Bev. 


For a,---ap € {1,2,3}", let Ray.-.a, = Wap 0°+* 0 Wa,(R). Let us observe that 
Raveian CR: 

One proves by induction that for all N, and for all r, there exists a, --- a; 
such that 5(N) belongs to Ra,...a,- Indeed, let v = 7*_, e:Biz € R; if v € Ri, 
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then there exists w € R such that v = y;(w). Furthermore, the same argument 
as in the proof of Proposition 10.8.12 implies that 6(V) belongs to the interior 
Of Rageaas 

Let us prove by induction on r that the interiors of the sets Ra,...a, are two- 
by-two disjoint. The induction property holds for r = 1 according to Lemma 
10.8.8. Assume it is true for k <r, with r > 1. Let a,---a, € {1,2,3}". One 
has 1(Ra,--a,.i) = (1/8)'u(Ra,.--a,), Which implies similarly as in the proof of 
Lemma 10.8.8 that the interiors of the sets Ra,...a,i, 1 = 1,2,3, are two-by-two 
disjoint in measure, as well as the interiors of the sets Ra,...a,i, for 7 = 1, 2,3, 
and a,--+-a@, € {1,2,3}". 

Hence for every N, and for every r, there exists a unique a,---a, such 
that 6(N) belongs to the interior of Rg,..a,. Furthermore it is easily seen 
that if there exists k such that a, 4 1, then there exists a coefficient ¢; equal 
to 1, with i < r, in the normal T-representation of N. This implies that if 
o(N) € B"R = wi"(R), then all the coefficients ¢; for i < m are equal to 0 in 
its normal T-representation. rT 


10.8.5. Tiling and Tribonacci translation 


We are now able to prove that the covering of the plane Pp stated in Proposition 
10.8.1, that is, U,ec, R+7, is in fact a periodic tiling (up to sets of zero measure). 


LEMMA 10.8.14. The sets Int R+-~y, for y € Lo, are disjoint, that is, 
ifz,y € Int R, witha —y€ Z’, then x = y. 


Proof. Let x,y € Int R with x—y € Z?. By density of the sequence (5(N)) n>, 
there exists a point 6(/) close enough to x so that 6(M)+y—z is close enough 
to y, and thus, still belongs to R. 

Let us choose an integer m large enough so that the coefficients ¢; in the nor- 
mal T-representation of M are equal to 0 for i > m. One gets M = eer ex T;. 
By density of the sequence (d(NV)), there exists N > M such that 


L\ (m+2)/2 

6) - GUN +y-m<0(Z) 

There exists h € Z? such that 6(N)—(6(M) +y—2) = (N-M)-(1/8,1/6?)—h. 

We thus can apply Lemma 10.8.13, and get that the normal T-representation 

e/T, of N — M satisfies ¢, = 0 for i <m+1. This implies that N admits 

as normal T-representation )77" 9 e;T; + S- e,T;, and hence 6(N) — 6(M) = 
i>mM4+2 


§(N — M). Since ||(N — M) - (1/8,1/68) ~ hl <0 (4)"""” < ©, it follows 
from Lemma 10.8.10 that 
5(N — M) = (N — M): (1/8, 1/62) — h = 6(N) — 6(M) = 6(N) -6(M) +2-y, 


which implies y = a. a 
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REMARK 10.8.15. The domain R is thus a fundamental domain of the torus 
T? =7/Z?, that is, 


R? = Uvex? R+ 0, Po = Uyer2?R +7; 
both unions being disjoint up to sets of zero measure. 


This tiling property has the following arithmetic formulation: the translation 
by (1/6, 1/67) in R?/Z? = T?, which is the quotient map of the exchange map 
F defined in Lemma 10.8.8 with respect to the lattice Lo, is coded by the 
Tribonacci word: 


THEOREM 10.8.16. The Tribonacci word codes the orbit of the point 0 under 
the action of the translation 


Rg: T? 5 T?, 2 2+ (1/8,1/6") 


with respect to the partition of the fundamental domain R of T? by the sets 
(is Ro, Rs), that is, 


VN EN, Vi=1,2,3, un =i => RG (0) € Ri. 


Proof. According to Proposition 10.8.12, for every N, there exists i = 1,2,3 
such that 6(N) belongs to the interior of R;; hence Rj (0) (which is congruent 
modulo Z? to 6(N)) also belongs modulo Z? to R;. Furthermore, such an integer 
i is unique according to Lemma 10.8.14. This implies that the coding of the 
orbit of 0 under Rg is well-defined. 

Let E be the exchange of pieces introduced in Lemma 10.8.8. Let us prove 
by induction on N that EY (0) = 7o(f(uo---un—1)). The induction property 
holds for N = 0. Suppose that the induction property holds for N. One has 
to(f (uo sia UNn-1)) € Int Rainy: Hence EN+1(0) = E(mo0(f (uo 2 un—1))) = 
To(f(uo+::Un-1)) + To(Cun) = To(f(uo-::un)), which ends the induction 
proof. 

One thus deduces that for all N € N, for all i = 1,2,3, EN (0) = mo(f(uo 
-++un—1)) € R; if and only if uy = 7. In other words, we have proved that the 
Tribonacci word codes the orbit of 0 under the action of the map FE with respect 
to the partition (Ri, R2, Rs), that is, 


VN EN, Vi=1,2,3, uy =i — > EN (0) =i. 


It remains to check that for all N €N, for all i = 1,2,3, EN (0) € R; if 
and only if Ra (0) € R;. By definition, the coordinates of E‘ (0) in the basis 
(e3 — €1,€3 — €2) are equal to 5(N), which is congruent to Rj (0) modulo Z?, 
which ends the proof. : 
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10.8.6. A cut and project scheme 


The aim of this section is to reformulate the previous reults in terms of “cut 
and project scheme”: Theorem 10.8.18 below states that the vertices of the 
broken line are exactly the points of Z? selected by shifting the Rauzy fractal 
(considered as an “acceptance window” ), along the eigendirection vg. 

A cut and project scheme consists of a direct product R* x H, k > 1, where 
H isa locally compact Abelian group, and a lattice D in R® x H, such that with 
respect to the natural projections py: R* x H > H and p,;: R* x H—R*: 


1. po(D) is dense in H; 
2. pi restricted to D is one-to-one onto its image p;(D). 


This cut and project scheme is denoted (R* x H, D). 

A subset I of R* is a model set if there exists a cut and project scheme 
(R* x H,D) and a relatively compact set (i.e., a set such that its closure is 
compact) Q of H with nonempty interior such that 


TP ={pi(P); P€ D, po(P) € OQ}. 


The set T is called the acceptance window of the cut and project scheme. 

A Meyer set S is a subset of some model set of R*, for some k > 1, which 
is relatively dense, that is, there exists R > 0 such that for all P € R*, there 
exists M € S such that the ball of radius R located at P contains M. 


REMARK 10.8.17. The locally abelian compact group which usually occur in 
the previous definition are either Euclidean or p-adic spaces. 


Let 7 denote the projection in R? on the expanding line generated by vg along 
the plane Py. Let us recall that 7 denotes the projection on the plane Pp along 
the expanding line. 


THEOREM 10.8.18. The subset 71({f(uo-::un—1); N € N}) of the expanding 
eigenline obtained by projecting under 7 the vertices of the Tribonacci broken 
line is a Meyer set associated with the cut and project scheme (R x R?,Z°), 
with acceptance window the interior of the set R of coordinates of the Rauzy 
fractal. In other words, 


{f(uo-+-un-1); N EN} = {P= (a,y,2) € 2°; et+y+z2 > 0;0(P) € IntR}. 
(10.8.5) 


Proof. Let H = R?, D = Z?, k = 1. The set H = R? is in one-to-one correspon- 
dence with the plane Po, whereas R is in one-to-one correspondence with the 
expanding eigenline. Up to these two bijections, the natural projections become 
respectively 7 and 71 and are easily seen to satisfy the required conditions (the 
density has been proved in the proof of Proposition 10.8.1). It remains to prove 
(10.8.5) to conclude. 


Version June 23, 2004 


522 Words in Number Theory 


According to Lemma 10.8.12, for every N, mo(f(wo--:un—-1)) € Int R. 
Conversely, let P = (2, y,z) € Z? with + y+z > 0 such that mo(P) € Int R. 
Let N = «+y+z. According to Lemma 10.8.2, there exists y € Lo such 
that P = f(uo---un—1) + y. Since 7o(P) = mo(f(uo-+:un—1)) + 7o(7) = 
To(f(uo+::un-1)) +7, one gets P = f(uo-:-un_1), following Lemma 10.8.14. 

| 


REMARK 10.8.19. Cut and project schemes are used to modelize quasicrystals 
and to generate aperiodic tilings, as illustrated in Problem 10.8.6. 


10.9. An application to simultaneous approximation 


We end this chapter with a section devoted to the study of some Diophantine 
approximation properties of the vector of translation (1/6,1/37) of the Tri- 
bonacci translation. In particular, the sequence of Tribonacci numbers is shown 
to be the sequence of best approximations of this vector for the Rauzy norm. 
Indeed, the vertices of the broken line of the form f(a”(1)), n € N, provide (af 
ter projection) very good approximations of the vector (1/3,1/67), and even, 
the best approximations for the Rauzy norm. 


Let v be vector and ||||o a norm in R?. The increasing sequence of positive 
integers (qn) is said to be the sequence of best approximations of the vector v 
for the norm || ||o if there exists a sequence of vectors (v,) such that for each 
integer n and for every w € Z? 


Ilan+1¥ — Un+illo < ||@nv — wl|o, 
and for every ¢ < dn41, 9 # Qn; and for every w € Z? then 
Ildnv _— Un|lo < Ilqu = wlo- 


THEOREM 10.9.1. 1. The vector (1/3,1/?) is badly approximable by the 
rational numbers, that is, there exists K > 0 such that for every positive 
integer N, then 

VN |||N - (1/8,1/6*)||| > K. 


2. For every norm, the sequence Ge) is bounded, where (q,) denotes the 


sequence of best approximations of the vector (1/3, 1/7). 


3. The Tribonacci sequence (T;,) is the sequence of best approximations of 
the vector (1/3,1/67) for the Rauzy norm. 


4. Furthermore 


1 


dim VT |IITn = (1/8, 1/8? )II1 = Ve toets 


Proof. 
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1. Let n > 1 and let m > 1 such that T;,_1 < N < JT),. Hence the normal T- 
representation of N = > joo Citi Satisfies €, 1 = 1. According to Lemma 
10.8.13, then ||| - (1/3, 1/6?)||| > C(1/8)'"/?. There exist two constants 
C, Co such that the Tribonacci sequence satisfies: VN € N, C16" <T, < 
C20". Hence 


: IG. ca, 
NV = (178, 1/8")||| 2c T, = JOnBN’ 


which ends the proof of the first assertion. Let us observe that such a 
statement (up to the choice of the positive constant K) also holds for 
every norm, by equivalence of the norms. 


2. Let || ||o be a norm in R? and let (q,,) be the sequence of best approxima- 
tions of (1/6, 1/7) associated with this norm. 


Let n and m such that Tin < dn < Tm4i. For all q < dn4i1, one has by 


definition |||q- (1/8, 1/8)|Ilo = |Ildn - (1/8, 1/")||lo, where ||| |||o denotes 
the distance to the nearest integer for the norm ||| |||o. We just have seen 


(proof of Assertion 1) that |||gn « (1/8, 1/2)||| > Kan’/*. Hence 
[Ilan 1/8, 1/8) > KT 


On the other hand, one has for / large enough, according to Lemma 10.8.10, 
that |||Tim+i47- (1/8, 1/67)||| = ||6(Din4142)||. Since the norms || ||o and 
|| || are equivalent, then [I|Tm4142 + (1/8, 1/8?)|llo = |16(Zn1+42)llo also 
holds for / large enough, still following Lemma 10.8.10. By equivalence of 
the norms, there exists a constant C3 such that for / large enough 


[|Zmaase: (1/8,1/87)Ilo = []6(Timai4a)llo < C3||6(Tm+141)|| 
= Calal™""* < Cav Caja ri. 


Hence there exists /o large enough such that 


[l|Tm14to (1/8, 1/87)Illo < |Ilan - (1/8, 1/6" )Illo, 
which implies that Tin414+1, = dn+i- Hence one has 


Gn+1 < Tint 1 pio <o,/0.8°"'. 

dn Tm 

3. The sequence (6(T;))n which satisfies 6(T;,) = |a|"** is a decreasing se- 

quence. Furthermore, for n > 8, ||5(T;,)|| < Ja|’* < C, which implies that 

\||Tn- (1/8, 1/67)||| = ||6(T;)||. One checks by considering a finite number 

of cases that when n < Tg, then the properties of good approximation 
hold for the Tribonacci sequence. 


Let us assume from now on that N > Tg. We want to prove that if 
N <Tn41 and N # Tp, then for every v € Z? 


[|5(Tn I < IN (1/8, 1/87) — oI]. 
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Since ||N - (1/6,1/?) — v|| < C implies that N - (1/6,1/87) —v = 6(N), 
it is sufficient to check that ||6(T;,)|| < ||6(V)||- 

Let n € N and let N < Ty41, N 4 Ty, with normal T-representation 
N = Mo<icp ili Let 19 = min{2| ¢; 4 0} (49 4 n since N # T,,); hence 
N= anes One has 


(CN) =| SO ecBizl| > |B zt eign Bet z||—|| So ecBizl|. 


io<i<n io $2<i<n 


Let us prove that ||B’° z+ €;,41.B%*+2|| — || > €,B*z|| > |a|!t*. 


in +2<i<n 
e Assume first that ¢;,41 =0. Then 


n 


| So eeBral| < [|BO* ) ei,4095B*z|| < 
>in +2 i>0 


joo**|lz|| _ Jal‘e*® 


ee ee ee: 


4 1—Ja|?—-Jal> 
Hence |[3(N)|| > jal’ (LSet) 5 0, 
e Assume now that ¢,,41 = 1, and thus ¢;,42 = 0. One has 


Jarl? 


CN) I] 2 lal (Iz + Bell — y— rE 


Ve 


2 3 
It remains to check that |a|* a ,||2+Bz||-< 


Jol" 12 
1=Ta3| rae > la 


to conclude. 
If n — ig > 8, then 5(N) > |a|*e+?? > Jal"*4 = ||d(T,)|I.- 
In the case where n — ig < 7, one checks by considering a finite number of 


cases that || 77") €:B’z|| > |a|™+4, for 0 < m < 7, which implies 


n—io 
8(N) = lal’*|| }) e:B*z|| > lal’*la|"“™ > ||6(Tr)II- 
i=0 


4. Let Ko be defined as the smallest real number such that there exist in- 


finitely many integers N satisfying 
VN ||IN - (1/8, 1/6?)|I| < Ko. 


iFrom Assertion 1, one deduces that Ko is finite. In fact, Ko is the smallest 
real number such that there exist infinitely many integers N satisfying 
VT» ||6(Tn)|| < Ko, since (T;,) is the sequence of best approximations of 
(1/3, 1/6). The following limit exists and equals: 


1 
glim VT l6(To)|| = ome 
hence the result. ] 
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Problems 


Section 10.4 
10.4.1 (ged). Let g > 2 be an integer. Prove that for all integers m,n > 1 
gcd(q’™ — 1,qg” — 1) = gcd(m,n). 
(Hint. If m > n and m = an + 8 with @ € [0,n — 1], prove that an 
integer divides both g™—1 and q” —1 if and only if it divides both q™ — 


and gq? —1. Then use the Euclidean algorithm to compute gcd’s). 
Deduce that g” — 1 divides g” — 1 if and only if m divides n. 


Section 10.5 


10.5.1 (Mobius function). Define the Mébius function y on the integers > 1 


1 ifn=1, 
p(n) := ¢ 0 if there exists k > 2 such that k? divides n, 
(—1)" ifn = pip2-++p,, where the p/s are distinct primes. 


a. Prove that, for every n > 1, 
lifn=1, 
SH = | oitn a2 


(Hint. Note that, ifn = [[,<;<, ad is the decomposition of n > 2 
into primes, then — 


Se se aon 
d\n d|pi---pr 0<j<r J 


b. Prove the Mobius inversion formula: if f and g are two maps 
defined on the positive integers, then 


Yn >1, g(n “LI =>Vn>1, f(n = 2a g(n/d). 


c. Prove that, if F and G are two maps defined on the real numbers, 
then (summing over n < x means that the summation is over the 
integers n such that n < x) 


Ve >0, G(x) = S> F(n) + Vx >0, F(z) = 57 p(n)G(=). 


n<ux n<u 


Version June 23, 2004 


526 Words in Number Theory 


d. Define a square-free number as an integer that is not divisible by 
any square of an integer > 2. Prove that for each integer n > 1 
there exists a unique square-free number qg and a unique integer a 
such that n = aq. 

e. Let |x| be the integral part of the real x. Let Q(x) be the number 
of square-free numbers smaller than x. Prove that, for each real 
number x > 0, 


q squarefree 

Deduce that ; 

x 

l?J= eS) 
nN<u 
and use Part b. above.) 
f. Prove that the density of the square-free numbers exists and is 
equal to >,5., w(n)/n?. 
(Hint. Write 
Qe) = ncye (MLB) = 2 Dna he t+ Ove) 
=2 Dna Ge + O(v2)) 


g. Prove that )7,5, u(n)/n? = 6/n?. 
(Hint. Write 


y APS S= Le L am) 


m>1 n>1 &>1 mie 


and use that }7,,5,1/n* = 17/6.) 
h. Let (mn) denote the number of primitive words of length n over 
an alphabet of size k. Prove that 
k” = N° x(a). 
d|n 
(Hint. Every word w can be written in a unique way as w = v%, 
where v is a primitive word, and d is an integer > 1. Of course d 
must divide the length of w.) 
Using the inversion formula b. above deduce that 


u(r) = So u(ayen/4. 


d|n 
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10.5.2 


(Algebraicity). Prove that, if the formal power series 5> a,X” has in- 
tegral coefficients and is algebraic over the field Q(X), then the formal 
power series )*(a,, mod p)X” is algebraic over F,(X). 


Section 10.7 


10.7.1 


10.7.2 


10.7.3 


(Complexity function of the Tribonacci word). First observe that the 
letters 2 and 3 are only followed or preceded by the letter 1 in the 
Tribonacci word u. Second, prove that every factor w distinct from 
the empty word ¢ of the Tribonacci word u can be uniquely written as 
follows: w = 1 0(v)r2, where v is a factor of u, rr € {€,2,3}, and rg = 1 
if the last letter of w is 1, and rz is the empty word, otherwise. Deduce 
from this the following combinatorial properties: 

a. Prove that the Tribonacci word u is not ultimately periodic, that 
is, periodic from some rank on. 

b. A factor w of a word z is said right special if there exist two distinct 
letters a and b such that both wa and wb are factors of z. Prove 
that the Tribonacci word admits exactly one right special factor of 
each length. 

c. The complexity function of an infinite word s is defined as the 
function P(s,n) which counts the number of distinct factors of 
length n of s. Deduce that the complexity function P(u,n) of the 
Tribonacci word satisfies: Vn € N, P(u,n) = 2n +1. 

d. Prove that the Tribonacci word is uniformly recurrent, i.e., every 
factor appears infinitely often with bounded gaps. 

e. Use the same method to prove that the Fibonacci word (defined 
in Section 10.1.4) admits exactly n +1 factors of length n. 

f. Prove that the topological entropy (as defined in Section 1.8.3) of 
the set of factors of the Tribonacci word as well as the topological 
entropy of the set of factors of the Fibonacci word are equal to 0. 

Prove that the Tribonacci word is not an automatic sequence, by consid- 
ering the probabilities of occurrence of the letters. Deduce from Problem 
1.8.1 and Section 1.8.6 the values of the probabilities of occurrence of 
the factors of length 2 of the Tribonacci word. 

(Dumont-Thomas numeration system on words). The aim of this prob- 
lem is to extend the statement of Lemma 10.7.2 to more general mor- 
phisms following Dumont and Thomas 1989, 1993, and Rauzy 1990. 
Let + be a morphism on the alphabet A satisfying the assumptions of 
Proposition 10.1.3. The prefix automaton of 7 is defined as follows: its 
edges are the letters of A; there is an edge from a to b labeled by p € A* 
if (a) = pas, where s € A*. For instance the prefix automaton of the 
Fibonacci morphism : 0 — 01, b — 0 is the Golden mean automaton as 
defined in Example 1.3.5, where the label a has to be replaced by 1 and 
b by 0. Let v be the fixed point of 7 having a as first letter, in the sense 
of Remark 10.1.4. Prove that every finite prefix of v can be uniquely 
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expanded as 


ima | 


T"(Dn)T"~"(Pn—1) *** Po; 


where py, # €, and pn---po is the sequence of labels of a path in the 
prefix automaton starting from the letter a. Conversely, prove that any 
such sequence of labels generates a finite prefix of v. 

10.7.4 (Statistics on letters for Pisot morphisms). A morphism 7 : A* — A* 
is said of Pisot type if first it satisfies the assumptions of Proposition 
10.1.3, and second, the eigenvalues of its incidence matrix satisfy the 
following: there exists a dominant eigenvalue a such that for every other 
eigenvalue A, one gets a > 1 > |A| > 0. Deduce from Problem 10.7.3 
that the results of Proposition 10.7.4 hold for any fixed point of a Pisot 
type morphism. 

10.7.5 (Uniform balance). An infinite word v € A” is said uniformly balanced 
if there exists C > 0 such that for any two factors w, w’ of the same 
length of v, and for any letter i € A, then 


| whi — lw" |i] < C. 


An infinite word v € A” is said to have bounded remainder letters if 
first, for every letter i, its probability of occurrence 7(7) in v exists, and 
second, there exists C’ such that 


YN, ||vovr-+:un—ili — 7(4)N| < CGC 


Prove that a sequence is uniformly balanced if and only if it has bounded 
remainder letters. Deduce from Problem 10.7.4 that a fixed point of a 
Pisot morphism is uniformly balanced. 

For more results on the balance properties of fixed points of morphisms, 
see Adamczewski 2003. 


Section 10.8 


10.8.1 (Bounded remainder sets). A measurable set X with respect to the 
Lebesgue measure j() in T? is said to be a bounded remainder set for 
the translation Rg if there exists C > 0 such that 


VN, |Card{i; 0<i< N, RG(0) € X}— p(X) <C. 


Deduce from Proposition 10.7.4 and Theorem 10.8.16 that the sets R,, 
i = 1, 2,3, are bounded remainder sets. 
Bounded remainder sets have been widely studied, see for instance Fer- 
enczi 1992. 

10.8.2 (Generalized Rauzy fractal and self-similarity). Let + be a primitive 
morphism of Pisot type over the alphabet {1,---,d}. Define similarly 
as in Section 10.8.1 a generalized Rauzy fractal R(T) as well as its 
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10.8.3 


10.8.4 


10.8.5 


division into the pieces R;(7), 7 = 1,---,d. Prove that the statement of 
Proposition 10.8.1 still holds. 
Prove that for 7 = 1,---,d, that 


Mz" (Ri(t)) = Ur<j<d Upis, 1(j)=pis (Ry (7) + M7" (0 © f(p))). 


(Hint. Apply M, to this equality.) This equality means that the pieces 
Ri(t) of the Rauzy fractal are self-similar (and more precisely self- 
affine), that is, they can be inflated under the expanding action of M>}, 
the image of each piece R;(7), i = 1,---,d being redivided into trans- 
lates of the sets R;(7). This result is a generalization of the statement 
of Lemma 10.8.7. This self-similarity property is considered for instance 
in Holton and Zamboni 1998, Arnoux and Ito 2001, Sirvent and Wang 
2002. 


Deduce from the proof of Proposition 10.8.1 an upper bound on the 
diameter of the Rauzy fractal ?. 

(G-numeration). Prove that every positive real number can be expanded 
as 


c= \° eB‘, where dé Z, Vi, e; € {0,1}, eieig1eit2 = 0, 
(10.10.1) 
by introducing the 6- transformation map Tz : [0,1[- [0,1[, x - {Gz}; 
such an expansion (with the above admissibility conditions (10.10.1)) is 
called a G-expansion; is there unicity of such an expansion? For more 
details on the $-numeration, see for instance Lothaire 2002. 
(F-property). The aim of this problem is to prove that the set Fin() of 
positive real numbers having a finite 6-expansion (see Problem 10.8.4) 
coincides with the set (Z[@~+]), of positive polynomials in 1/3 with 
integer coefficients. This property is called the F-property and has been 
introduced in Frougny and Solomyak 1992, see also Akiyama 1999. 
a. Let Z,[{3~+] denote the set of polynomials in 1/3 with non-negative 
integer coefficients. Prove that Fin(@) is included in Z,[3~1]. (Use 
the fact that 1 = 6° + 6? + 8.) 
b. The aim of this question is to prove that Z+[3~'] = (Z[G~+])+. Let 
x € (Z[B~+]),. Prove that there exists s € N and (29,271, 22) € Z 
such that 


L£= F (0 + mp} + x23”) = aa < (xo, L1,%2), 0g >, 
(we consider here the Euclidean scalar product in R°). Deduce 
that for alln EN, x = cag <'M"(ao,21,22),Ug > . Apply the 
Perron Frobenius theorem to conclude. 

c. The aim of this question is to prove that Z+[3~']M[0, 1[ is included 
in Fin(@). For that purpose we introduce an algorithm consisting in 
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the repetition of the action of two steps A; and Ag, that transforms 
a finite 3-representation of x (with digits not necessarily satisfying 
the admissible conditions (10.10.1)) into the 6-expansion of x. 
Let «= ae xi 3—* € Z,[B-"]N [0,1[, where Vi, x; € N. 

Step A;. Assume that there exists an integer k > 1 such that 
Cri 21, tey2 > 1, and r,43 > 1. Let A, be the algorithm which 
maps (x;);>1 (where we set x; =0 for i > d) to 


(xi) = a1-++ (te + 1)(te41 — 1)(te+2 — 1)(ee+3 — 1)eeta-: 


Prove that }7.2, < ¥),ay; and that >. a0-*= >a) 8-*. 


Step Ag. Assume that there exists an index k such that x, > 
Prove that k > 2. Let | be the smallest integer such that x; > 2. 
Let Ag be the algorithm which sends (2;);>1 to 


(xi) = 21-+- (a1 —1)(@41 + 1D)(ti42 + 1)(t143 +1)--- 


Let k > 1 be the largest integer such that k < / and 2, 
Thro 21, t,43 21. Then, the algorithm A sends (xj) to 


IV 
— 


(ai) = a+ (a, + Dep — D@er2 — VD (@e43 — 1). 


The sequence (2/’) is the image of (a;) under the action of Ag. 


Prove that }°.0/ =}, a) atid that } 7,0" =). 27 B*. 
We now apply repeatedly steps A; and Ag to x, defining a sequence 
(7) such that for all 7, 2 takes finitely but all zero values. If for 
some value jo, «4°) satisfies the admissibility conditions (10.10.1), 
then we set 7) = 70) for 7 > jp, and we apply no step anymore. 
Prove that for 7 large enough, step A; cannot be performed any 
more. 
Let us assume that A» can be applied indefinitely. Let J be such 
that for 7 > J, step A; cannot be performed any more. Let 1; 
denote, for 7 > J, the smallest index | such that oJ ) > 2. Prove 
that (1;) tends to infinity and that the sequence (2) is convergent. 
Find a contradiction. 
d. Conclude. 
We have followed here the proof of Frougny and Solomyak 1992. 
10.8.6 (A tiling of the line). We have seen that m({f(uo:::un—1; N € N})) 
is a Meyer set in Section 10.8.6. We associate here in a natural way to 
this set of points a tiling of the line. 
Prove that 7 ({f(uo--:un—1; N € N})) defines a tiling T of the half- 
line generated by vg in the positive octant {(z, y, z); x,y, z > 0} by seg- 
ments of three distinct lengths, say J), l2, 13, In other words, prove that 
the distance between two successive points of m({f(uo---un—1; N € 
N})) (with respect to the orientation on the half-line provided by vg) 
equals either 1,, lz or Is. 
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Prove that under a suitable choice of a unit vector on the half-line, then 
T1({f(uo::-un—1; N € N})) is in one-to-one correspondence with the 
set of G-integers 


d 


Ly = op ei8'; dEN, Vi, &; € {0,1}, exeizieip2 = OF. 
i=0 


Let (tn)n>o denote the set of elements of 71({f(wo---un—1; N € N})) 
ordered in increasing order (still with respect to the orientation on the 
half-line provided by vg). One can code the tiling T as follows: for 
n > 0, for i = 1,2,3, then v, = 7 if and only if tn41 —t, = 1;. Prove 
that (Un)n>o0 is equal to the Tribonacci word. 


Notes 


For general references on substitutive sequences and substitutive dynamical sys- 
tems, see for instance Queffélec 1987 and Fogg 2002. 

The examples in Section 10.1.4 are famous. For more on Sturmian words, one 
can read for example Lothaire 2002 and Fogg 2002. For more on the Thue—Morse 
word, its history, and its many occurrences in the literature, see for example 
Allouche and Shallit 1999. The Rudin-Shapiro word was first introduced in 
Shapiro 1952. For all these words and for the paperfolding word one can read 
the notes of Allouche and Shallit 2003. 

In Section 10.2.5, Lemma 10.2.13 is a classical result in Perron—Frobenius 
theory (see for example Gantmacher 1959). The main theorem of Section 10.2.5 
(Theorem 10.2.15) is due to Cobham 1972. 

The main theorem of Section 10.3.3 (Theorem 10.3.4) was proved in Chris- 
tol 1979, see also Christol, Kamae, Mendes France, and Rauzy 1980. More 
generally, it is also possible to give a simple combinatorial characterization of 
primitive substitutive sequences (see Durand 1998, Holton and Zamboni 1999). 

The first proof of Theorem 10.4.2 in Section 10.4 is due to Wade 1941. The 
proof we give here is adapted from a proof given in Allouche 1990. 

The proof of Proposition 10.5.1 that we give in Section 10.5 comes from Al- 
louche 1997. The first proof was given in Petersen 1994 and Petersen 1996. 
For a proof of the theorem of Chomsky-Schtitzenberger, see Chomsky and 
Schiitzenberger 1963. 

The theorem of Ridout given without proof in Section 10.6 was given in 
Ridout 1957. Corollary 10.6.2 is due to Ferenczi and Mauduit 1997. Theo- 
rem 10.6.3 is also due to Ferenczi and Mauduit 1997 under a more general form. 
A slightly more precise result in the case of binary alphabets is given in Allouche 
and Zamboni 1998. For more results on the transcendence of “automatic” real 
numbers, see for example Allouche and Shallit 2003. 

We do not claim here for exhaustivity in our choice of applications of the 
Rauzy fractal in number theory. We have chosen the more representative prop- 
erties which also motivated G. Rauzy in its study of the Tribonacci word in 
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Rauzy 1982. All the results of Section 10.8 follow carefully the approach of 
the seminal paper Rauzy 1982, from which come the proofs of Theorem 10.8.16, 
Lemma 10.8.10 and 10.8.13, as well as the introduction of the matrix B, whereas 
the proof of Theorem 10.9.1 is due to Chekhova, Hubert, and Messaoudi 2001. 
The fact that the vector (1/3, 1/67) is badly approximable by the rationals (As- 
sertion 1 of Theorem 10.9.1) is a classical statement for elements of a totally 
real field number (see for instance Cassels 1957). 


Arnouz—Rauzy words. The Tribonacci translation first occurred in Arnoux 1988, 
where the Tribonacci morphism was used to model an interval exchange map of 6 
intervals and to build explicitly a continuous and surjective conjugacy between 
this interval exchange map and the Rauzy translation (see also Arnoux and 
Yoccoz 1981); these results have led to the introduction of the family of Arnoux- 
Rauzy words in Arnoux and Rauzy 1991, to which the Tribonacci word belongs, 
as a generalization of the family of Sturmian words. 

Arnoux—Rauzy words are defined as the one-sided words x with complexity 
P(a,n) = 2n-+1 for all n which are recurrent and which have for every length a 
unique right special factor and a unique left special factor, each of these special 
factors being extendable in three different ways. Let us note that they can be 
similarly defined over any alphabet of larger size, say d; one thus obtains infinite 
words of complexity (d—1)n+1. Contrary to the Sturmian case, these words are 
not characterized by their complexity function any more. For instance, codings 
of non-degenerated three-interval exchanges have also complexity 2n+1. Let us 
observe that Arnoux—Rauzy words can be described as exchanges of six intervals 
of the unit circle (Arnoux and Rauzy 1991). 

The combinatorial properties of the Arnoux—Rauzy words are well-under- 
stood and are perfectly described by a two-dimensional continued fraction algo- 
rithm defined over a subset of zero measure of the simplex introduced in Arnoux 
and Rauzy 1991, Risley and Zamboni 2000, Zamboni 1998 and in Chekhova 
2000. By using this algorithm, one can express in an explicit way the probabil- 
ities of occurrence of factors of given length (Wozny and Zamboni 2001), one 
can count the number of all the factors of the Arnoux—Rauzy words (Mignosi 
and Zamboni 2002), or prove that the associated dynamical system has always 
simple spectrum (Chekhova 2000). See also Castelli, Mignosi, and Restivo 1999, 
and Justin 2000 for the connections with a generalization of the Fine and Wilf’s 
theorem for three periods. The family of Arnoux—Rauzy words has been itself 
extended to the family of episturmian words (Justin and Pirillo 2002b, 2002a). 


Rauzy fractal. The study of the topological properties of the Rauzy fractal is 
mainly due to Rauzy 1982, 1988, where the Rauzy fractal R is shown to be 
connected with simply connected interior (and so do the three pieces of the 
Rauzy fractal R;, i = 1,2,3). See also Messaoudi 1998 and Messaoudi 2000a 
for a parametrisation of its boundary, the points which have several expansions 
being studied in details (see also Remark 10.8.9). For a study of its fractal 
boundary, see Ito and Kimura 1991, where it is proved to be a Jordan curve 
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generated by Dekking’s fractal generation method (Dekking 1982), from which 
a computation of its Hausdorff dimension is deduced. 

Theorem 10.8.16 states that the translation v + v+(1/3, 1/87) on T? can be 
coded using the Tribonacci morphism. In dynamical terms, this theorem extends 
to the fact that the symbolic dynamical system generated by the Tribonacci 
word is measure-theoretically isomorphic to a translation of the torus T?, the 
isomorphism being a continuous onto map. Furthermore it is also possible to 
construct a Markov partition for the toral automorphism of T? of matrix given 
by the incidence matrix of the Tribonacci morphism, this construction being 
based on the Rauzy fractal. 

More generally, it is possible to associate a generalized Rauzy fractal to any 
Pisot unimodular morphism (see Problem 10.8.2). (A morphism is said unimod- 
ular if the determinant of its incidence matrix equals +1.) There are several 
definitions associated with several methods of construction for such Rauzy frac- 
tals. We have given here a definition based on formal power series inspired by 
the seminal paper Rauzy 1982, by Messaoudi 1998, 2000a, and by Canterini 
and Siegel 2001a, 2001b. A different approach via iterated function systems 
and generalized substitutions has been developed following ideas from Ito and 
Kimura 1991, and Arnoux and Ito 2001, Sano, Arnoux, and Ito 2001. Indeed, 
Rauzy fractals can be described as the attractor of some graph iterated function 
system (IFS), as in Holton and Zamboni 1998 where one can find a study of 
the Hausdorff dimension of various sets related to Rauzy fractals, and as in Sir- 
vent 2000a, 2000b, Sirvent and Wang 2002 with special focus on the self-similar 
properties of Rauzy fractals (see Lemma 10.8.7 and Problem 10.8.2). For more 
details on both approaches, see Chap. 7 and 8 of Fogg 2002. Both methods 
apply to unimodular morphisms of Pisot type. 

More generally, for any unimodular morphism of Pisot type the measure- 
theoretical isomorphism with a translation on the torus (or equivalently the 
existence of a periodic tiling of the plane by the Rauzy fractal) is conjectured 
to hold. A large literature is devoted to this question, which is surveyed in Fogg 
2002, Chap.7. Inspired by Bedford 1986, Ito and Ohtsuki 1993 extends Rauzy’s 
approach in order to produce Markov partitions for toral automorphisms pro- 
duced by the modified Jacobi-Perron algorithm. See also Praggastis 1999. 

In particular, Arnoux—Rauzy words which are fixed points of primitive mor- 
phisms (which are thus unimodular and of Pisot type following Arnoux and Ito 
2001) also generate symbolic dynamical systems which are measure-theoretically 
isomorphic to toral translations. It was believed that all Arnoux—Rauzy words 
originated from toral translations, and more precisely, that they were natural 
codings of translations over T?. This conjecture was disproved in Cassaigne, 
Ferenczi, and Zamboni 2000. 


Tribonacci numeration system. The Tribonacci numeration system is the canon- 
ical numeration system associated with the positive root 3 of X? = X?4+X+1. 
More generally, for a given 3 > 1, one can expand real numbers in [0,1] as 
powers of the number using the greedy algorithm: x = )°*., b.G-*, with 
some conditions on the nonnegative integers b, (see also Problem 10.8.4); such 
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expansions are called G-expansions and are generated by the 3-transformation 
xz ++ Ba — [Gx] which also generates as a dynamical system the (-shift (for 
more details, see for instance Lothaire 2002). One can also represent natural 
integers in a base given by an infinite sequence of integers (which generalizes 
Lemma 10.7.1) canonically associated with the G-numeration: the set of fac- 
tors of greedy representations of natural integers in this base and the factors 
of the (-shift are the same. Similar compact sets with fractal boundary are 
considered as geometrical representations of the 6-shift when (@ is a Pisot unit, 
in Thurston 1989, in Akiyama 1999 and in Praggastis 1999, where topological 
or tiling properties such as Proposition 10.8.12 or Lemma 10.8.14 are studied 
in connection with the so-called F-property (Frougny and Solomyak 1992) (see 
also Problem 10.8.5). Generalized Rauzy fractals issued from the 3-numeration 
are also closely related to canonical number systems (see for instance Akiyama 
and Peth6 2002). 

There are also some close connections between the dynamical properties of 
the Rauzy fractal and the extension of the Fibonacci multiplication (introduced 
in Knuth 1988) to the Tribonacci recurrence relation, as studied for instance in 
Arnoux 1989 and Messaoudi 2000b, 2002. 

Rauzy fractals can be used to characterize the numbers that have a purely 
periodic 3-expansion, producing a kind of generalized Galois’ theorem on clas- 
sical continuous fractions. It is known following Schmidt 1980 and Bertrand 
1977 that elements of Q(3) have a ultimately periodic expansion when ( is a 
Pisot number. A characterization of those points having an immediately peri- 
odic expansion is given in Sano 2002, see also Ito and Sano 2001, by introducing 
a realization of the natural extension of the G-transformation acting on the as- 
sociated generalized Rauzy fractal for @ being a Pisot unit which is a simple 
G-number. See also Ito 2000 (and more generally Gambaudo et al. 2000) for 
closely related results for elements of cubic fields. Let us observe furthermore 
that the results of Section 10.9 can be extended following the same ideas to 
other cubic numbers (Ito 1996, Ito, Fujii, and Yasutomi 2003). Such results can 
also be proved using algebraic geometry following Adams 1969. 

Rauzy tilings have also been studied in theoretical physics and quasycristal 
theory in Vidal and Mosseri 2000, 2001 as outlined in Section 10.8.6, where we 
have followed the terminology of Moody 1997. For more on mathematical qua- 
sicrystals, see for instance Baake and Moody 2000. See also Burdik et al. 1998, 
2000, Verger Gaugry and Gazeau 2004 for connected results in the framework 
of beta-numeration. 
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